Philadelphia Housing Model—Technical Appendix

Predicting Residential Property Prices

Author

Ming Cao, Mark Deng, Jun Luu, Josh Rigsby, Alex Stauffer, Tess Vu

Published

October 28, 2025

Data Cleaning

Primary Philadelphia Sales Data (OpenDataPhilly)

# Import relevant libraries.
library(car)
library(dplyr)
library(ggcorrplot)
library(ggplot2)
library(grid)
library(gridExtra)
library(knitr)
library(kableExtra)
library(scales)
library(sf)
library(stringr)
library(tibble)
library(tidycensus)
library(tidyr)
library(tidyverse)
library(tigris)
library(tmap)
library(units)

options(scipen = 999)
# Property data.
properties_path <- "data/philly_properties.csv"
properties <- read.csv(properties_path)

# Capture dimensions.
og_property_dimension <- dim(properties)

# Set Census API key.
census_api_key("3aaee31789e10b674a531e9f236c35d5394b19ed")
# All variables are character strings, remove white space then convert numeric character variables to numeric classes for chosen variables.

properties <- properties %>%
  mutate(across(where(is.character), str_trim),
         across(c(fireplaces, garage_spaces, number_of_bathrooms, number_stories,
                  sale_price, total_livable_area, year_built), as.numeric)) %>%
  rename(fireplace_num = fireplaces,
         garage_num = garage_spaces,
         bath_num = number_of_bathrooms,
         story_num = number_stories,
         square_feet = total_livable_area,
         basement_type = basements,
         ac_binary = central_air,
         fuel_type = fuel,
         )
# Filter to residential properties and 2023-2024 sales.
# Note: Category code #1 is for residential.
residential_prop <- properties %>%
  filter(.,
         category_code == 1,
         startsWith(sale_date, "2023") | startsWith(sale_date, "2024"))

# Drop empty variables or variables not needed for model.
residential_prop <- residential_prop %>%
  select(c(basement_type, ac_binary, fireplace_num, fuel_type, garage_num,
           bath_num, story_num, sale_date, sale_price,
           square_feet, year_built, shape)
         )

# Make empty character column values NA.
residential_prop <- residential_prop %>%
  mutate(across(where(is.character), ~na_if(., "")))
# Drop prices that are less than $10,000 as a catch-all (might not be as reflective for rural areas). Avoiding dropping prices based on percent of assessed value since property assessments can be biased against minoritized communities. Ideal drop would add deed type to drop any family or forced transfers.
residential_prop <- residential_prop %>%
  filter(sale_price > 10000,
         square_feet > 0) %>%
  mutate(.,
         price_per_sqft = sale_price / square_feet) %>%
  filter(.,
         price_per_sqft > quantile(price_per_sqft, 0.05, na.rm = TRUE))
# Create building age column, change central air to binary, and adjust basement and fuel types.
# Create log value for the sale price.
residential_prop <- residential_prop %>%
  mutate(ln_sale_price = log(sale_price),
         age = 2025 - year_built,
         ln_square_feet = log(square_feet),
         ac_binary = case_when(
           ac_binary == "Y" ~ 1,
           ac_binary == "N" ~ 0),
         basement_type = case_when(
           basement_type == "0" ~ "None",
           basement_type == "A" ~ "Full Finished",
           basement_type == "B" ~ "Full Semi-Finished",
           basement_type == "C" ~ "Full Unfinished",
           basement_type == "D" ~ "Full Unknown Finish",
           basement_type == "E" ~ "Partial Finished",
           basement_type == "F" ~ "Partial Semi-Finished",
           basement_type == "G" ~ "Partial Unfinished",
           basement_type == "H" ~ "Partial Unknown Finish",
           basement_type == "I" ~ "Unknown Size Finished",
           basement_type == "J" ~ "Unknown Size Unfinished"),
         fuel_type = case_when(
           fuel_type == "A" ~ "Natural Gas",
           fuel_type == "B" ~ "Oil Heat",
           fuel_type == "C" ~ "Electric",
           fuel_type == "D" ~ "Coal",
           fuel_type == "E" ~ "Solar",
           fuel_type == "F" ~ "Wood",
           fuel_type == "G" ~ "Other",
           fuel_type == "H" ~ "None")
         )
# Turn categorical data into factors so OLS regression doesn't handle the data as a list of strings.
residential_prop$basement_type <- as.factor(residential_prop$basement_type)
residential_prop$fuel_type <- as.factor(residential_prop$fuel_type)

# Change the reference categories for baseline comparison.
residential_prop$basement_type <- relevel(residential_prop$basement_type, ref = "None")
residential_prop$fuel_type <- relevel(residential_prop$fuel_type, ref = "Natural Gas")

# Place fuel type with 10 or less counts into other category.
residential_prop <- residential_prop %>%
  mutate(fuel_type = fct_lump_min(fuel_type, min = 11, other_level = "Other"))
# Fixed effect temporal market fluctuations. Based on sale date, splitting the years into quarters (Q1, Q2, Q3, Q4). Potential fixed effect.
residential_prop <- residential_prop %>%
  mutate(
    quarters_fe = quarter(as_datetime(sale_date))
    )

# Make it a factor.
residential_prop$quarters_fe <- factor(residential_prop$quarters_fe)
# Capture dimensions.
after_property_dimension <- dim(residential_prop)

# Convert residential property to geodataframe. Use EPSG 2272 for South Pennsylvania in feet.
# Drop shape when finished creating geometry.
residential_prop_gdf <- residential_prop %>%
  mutate(geometry = st_as_sfc(shape)) %>%
  st_as_sf(crs = 2272) %>%
  rename(geometry_point = geometry) %>%
  select(-c(shape))

Spatial Data and Feature Engineering

Spatial Data (OpenDataPhilly)

# Read in Philadelpha census tracts.
philly_tracts_path <- "data/philly_tracts/philly_tracts.shp"
philly_tracts <- st_read(philly_tracts_path)
Reading layer `philly_tracts' from data source 
  `F:\Upenn\MUSA 5080 Public Policy Analytics\portfolio-setup-MarkD12138\assignments\assignment_3\data\philly_tracts\philly_tracts.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 3446 features and 12 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -80.51985 ymin: 39.7198 xmax: -74.68956 ymax: 42.51607
Geodetic CRS:  NAD83
# Match CRS.
philly_tracts <- st_transform(philly_tracts, crs = 2272)

# Left spatial join.
residential_points <- st_join(residential_prop_gdf, philly_tracts)

# Drop unnecessary columns and remove incomplete observations (rows) for upcoming spatial feature computations.
residential_points <- residential_points %>%
  select(-c(FUNCSTAT, MTFCC, NAMELSAD, NAME,
            STATEFP, COUNTYFP, TRACTCE)) %>%
  na.omit(.)

# Remove unneeded datasets for housekeeping and call garbage collector to reduce memory.
rm(properties, residential_prop, residential_prop_gdf)
gc()
           used  (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells  8656939 462.4   14727822  786.6  12442875  664.6
Vcells 73514718 560.9  139968249 1067.9 139967963 1067.9
# Proximity to downtown.

# Decided on Euclidean distance because network proximity computation is demanding on thousands of points, even with parallel programming.

# Create single Center City point feature based on City Hall coordinates.
center_city <- st_sfc(st_point(c(-75.163500, 39.952800)), crs = 4326) %>%
  st_transform(crs = 2272)

# Need to add mile units for operations. Then remove units object for easier plotting.
residential_points$city_dist_mi <- (st_distance(residential_points, st_union(center_city))) %>%
  set_units("mi") %>%
  drop_units() %>%
  as.numeric()

# Log transform because distance benefit diminishes, for potential use.
residential_points$ln_city_dist <- log(residential_points$city_dist_mi + 0.1)
# Transit proximity.
# Major cities could be distance to nearest transit like metro/rail stations, but suburban and rural areas might be better served by distance to nearest major highway.
# Read in SEPTA stops.
septa_stops_path <- "data/septa_stops.csv"

septa_stops_df <- read.csv(septa_stops_path)

# Make csv a geodataframe.
septa_stops <- septa_stops_df %>%
  st_as_sf(., coords = c("Lon", "Lat"), crs = 4326)

# Match CRS.
septa_stops <- septa_stops %>%
  st_transform(., crs = 2272)

# Stops are duplicated for the same station because the data includes directions for all cardinal directions as well as bus, rail, and trolley for the same location. This means a single station could have more than one point representing a single location residents go to commute.
# Create new column with stop name without the cardinal suffixes and keep only the unique station values.
septa_stops <- septa_stops %>%
  mutate(stations = if_else(
    str_detect(StopAbbr, "NO$|SO$|EA$|WE$|NE$|NW$|SE$|SW$"),
    str_sub(StopAbbr, end = -3),
    str_sub(StopAbbr)
    )
  ) %>%
  distinct(stations, .keep_all = TRUE)

# Create buffer zone for stops within a half mile. This is ~10 minute walk, depending on topography.
# Note: EPSG 2272 is measured in feet, not miles.
septa_distance <- st_buffer(residential_points, 2640)

# Create number of stops in the buffer zone.
septa_stations <- st_intersects(septa_distance, septa_stops)

# Append buffer zone counts and put into main tract data. Create a logged version for potential use as well because distance benefit tapers off.
residential_points <- residential_points %>%
  mutate(
    septa_half_mi = lengths(septa_stations),
    ln_septa_half_mi = log(septa_half_mi + 0.1)
  )
# Park proximity / size. Measuring distance is important for accessibility, but the size of the park often matters because a property near a block-sized pocket of green space is not equivalent to being near a large one like Wissahickon Valley Park.

# Read in geojson data.
parks_path <- "data/parks.geojson"

parks <- st_read(parks_path)
Reading layer `parks' from data source 
  `F:\Upenn\MUSA 5080 Public Policy Analytics\portfolio-setup-MarkD12138\assignments\assignment_3\data\parks.geojson' 
  using driver `GeoJSON'
Simple feature collection with 63 features and 18 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -75.2837 ymin: 39.87048 xmax: -74.95865 ymax: 40.13191
Geodetic CRS:  WGS 84
# Match CRS and filter by parks.
parks <- parks %>%
  st_transform(., crs = 2272) %>%
  filter(str_detect(USE_, "PARK"))

# Get distance to the edge of the nearest park.
# Note: Don't try to do spatial operations in apply() and mutate().
# Distance matrix of residential properties to parks.
parks_matrix <- st_distance(residential_points, parks)

# Get the nearest distance for each point.
residential_points$parks_mile <- apply(parks_matrix, 1, min)

# Convert to miles.
residential_points$parks_mile <- as.numeric(residential_points$parks_mile) / 5280

# Log parks data for potential use because of diminishing distance benefits.
residential_points$ln_park_dist <- as.numeric(log(residential_points$parks_mile + 0.1))
# Convenience/Food points of interest. Using kNN to measure the density of these amenities rather than nearest amenity point.
amenities_path <- "data/osm_pois/osm_pois.shp"
amenities <- st_read(amenities_path)
Reading layer `osm_pois' from data source 
  `F:\Upenn\MUSA 5080 Public Policy Analytics\portfolio-setup-MarkD12138\assignments\assignment_3\data\osm_pois\osm_pois.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 65127 features and 4 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -80.52111 ymin: 39.71816 xmax: -74.69473 ymax: 42.25797
Geodetic CRS:  WGS 84
# Filter amenities by convenience and food.
amenities <- amenities %>%
  filter(fclass %in% c(
    "atm", "bakery", "bank", "bar", "beauty_shop", "biergarten", "bookshop",
    "butcher", "cafe", "convenience", "department_store", "fast_food", "food_court",
    "greengrocer", "hairdresser", "kiosk", "laundry", "market_place", "pharmacy",
    "mall", "pub", "restaurant", "supermarket"
  )
         ) %>%
  st_transform(., crs = 2272)

# Distance matrix of residential properties to amenities.
amenities_matrix <- st_distance(residential_points, amenities)
# k-Nearest Neighbors (kNN) function.
knn_distance <- function(distance_matrix, k) {
  apply(distance_matrix, 1, function(distances){
    mean(as.numeric(sort(distances)[1:k]))
  })
}

# Create kNN feature for amenities. k = 4 to balance for urban and suburban areas, probably not as representative of rural areas.
residential_points <- residential_points %>%
  mutate(
    knn_amenity_mi = as.numeric(knn_distance(amenities_matrix, k = 4))
  )

# Convert to miles.
residential_points$knn_amenity_mi <- as.numeric(residential_points$knn_amenity_mi / 5280)
# Fixed effect neighborhoods.
neighborhoods_path <- "data/philadelphia_neighborhoods/philadelphia_neighborhoods.shp"
philly_neighborhoods <- st_read(neighborhoods_path)
Reading layer `philadelphia_neighborhoods' from data source 
  `F:\Upenn\MUSA 5080 Public Policy Analytics\portfolio-setup-MarkD12138\assignments\assignment_3\data\philadelphia_neighborhoods\philadelphia_neighborhoods.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 159 features and 5 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -75.28026 ymin: 39.86701 xmax: -74.95576 ymax: 40.13799
Geodetic CRS:  WGS 84
# Match CRS.
philly_neighborhoods <- philly_neighborhoods %>%
  st_transform(., crs = 2272)

# Join to residential points and rename to neighborhoods.
residential_points <- residential_points %>%
  st_join(., philly_neighborhoods) %>%
  rename(neighborhood_fe = MAPNAME)

# Make the neighborhoods a factor.
residential_points$neighborhood_fe <- relevel(factor(residential_points$neighborhood_fe), ref = "East Falls")

# Place neighborhoods with 10 or less sales into a small neighborhoods category.
residential_points <- residential_points %>%
  mutate(neighborhood_fe = fct_lump_min(neighborhood_fe, min = 11, other_level = "Small Neighborhoods"))
# Capture spatial feature dimensions.
after_feat_eng_dimension <- dim(residential_points)
# Spatial feature creation table, select spatial features into a separate data frame and drop geometry.
spatial_feature_df <- residential_points %>%
  select(c(city_dist_mi, ln_city_dist, septa_half_mi, ln_septa_half_mi,
           parks_mile, ln_park_dist, knn_amenity_mi)) %>%
  na.omit(.) %>%
  st_drop_geometry()

# Create a tibble from the selected spatial features.
spatial_summary <- tibble(
  "Spatial Features" = names(spatial_feature_df),
  "Description" = c("Distance from city (mi).", "Log of distance from city.", "Within 0.5mi of SEPTA station.",
                    "Log of 0.5 SEPTA station.", "Distance from nearest park (mi).",
                    "Log of distance from nearest park.", "k-Nearest Neighbors convenience and food amenities.")
)

# Make Kable of spatial features.
spatial_kable <- kable(spatial_summary,
                       caption = "Feature Engineered Variables",
                       format.args = list(big.mark = ",")
) %>%
  kable_styling(latex_options = "striped",
                full_width = FALSE) %>%
  column_spec(1, bold = TRUE, width = "5cm") %>%
  row_spec(0, color = "#f5f4f0", background = "#ff4100", bold = TRUE)

spatial_kable
Feature Engineered Variables
Spatial Features Description
city_dist_mi Distance from city (mi).
ln_city_dist Log of distance from city.
septa_half_mi Within 0.5mi of SEPTA station.
ln_septa_half_mi Log of 0.5 SEPTA station.
parks_mile Distance from nearest park (mi).
ln_park_dist Log of distance from nearest park.
knn_amenity_mi k-Nearest Neighbors convenience and food amenities.

Primary:

From the CSV, we are analyzing the conditions of basements, number of fireplaces, garage spaces, number of bathrooms, number of stories, the total livable area in square feet, the existence of central air, and type of fuel used on the property. We filtered residential category code with the sale dates between 2023 and 2024. We eliminated property prices <10k. Rather than adhering to the percentage of assessed value as a guide for this filter, for it could incorporate marginalized bias, filtering the property prices removes non-market transactions but still incorporates a wide diversity of communities.

The forced the central air characteristic to become binary rather than “Y” and “N” and made sure to turn the categorical data to factors. The reference categories for types of basements is “None” and for fuel type, it’s “Natural Gas”. We including a building age category computed from the year built data. We logged the square footage and the sale price to correct for right-skewedness.

Spatial:

We inserted the Philadelphia census tracts, changing the CRS to 2272, the ideal projection for SE Pennsylvania analysis. We decided to perform a log transformation on the following variables - city_dist (distance from City Hall in Center City), septa_half_mi (half mile buffer zone from all septa stops within the city geometry), and parks_dist (distance to edge of nearest park in miles) - because their effects on housing prices were non-linear. This transformation ensures that changes in proximity are measured more consistently across the range of distances, rather than being dominated by properties very close to these features.

Regarding amenities, we used k-NN (k nearest neighbors) to measure the density of amenity accessibility rather than individual point data. The amenities are as follows: ATM, bakery, bank, bar, beauty shop, biergarten, bookshop, butcher, café, convenience, department store, fast food, food court, greengrocer, hairdresser, kiosk, laundry, marketplace, pharmacy, mall, pub, restaurant, supermarket. We filtered by convenience and food, transformed the CRS to 2272 for consistency, and then developed a matrix. The distance was inverted, log-transformed to account for diminishing returns, and scaled it to produce a single numeric value in which higher, positive values indicate greater accessibility to amenities.

We included neighborhoods as fixed effects to help explain unknown, unquantifiable factors like cultural atmosphere and other neighborhood-specific factors we cannot statistically account for that may influence housing prices. It was converted into a factor so each neighborhood can receive its own baseline model. Fiscal quarters were also introduced as fixed effects; splitting a year into 4 quarters for unknown factors when it comes to purchasing property (e.g. people are more likely to buy real estate in spring and summer).

Census Data (TidyCensus)

# Open tidycensus data. Using 2023 data, because we are looking at sales 2023-2024
acs_vars <- load_variables(2023, "acs5", cache = TRUE)

# Get acs dimensions.
og_acs_dimension <- dim(acs_vars)

# The variables that we want from tidycensus
variables <- c(
  median_household_income = "B19013_001",
  total_pop = "B01003_001",
  poverty_white = "B17001A_001", # To get poverty percentage
  poverty_black = "B17001B_001",
  poverty_native = "B17001C_001",
  poverty_asian = "B17001D_001",
  poverty_islander = "B17001E_001",
  poverty_other = "B17001F_001",
  poverty_multiracial = "B17001G_001",
  male_18_24_bach = "B15001_009", # Tracts only show bachelor's degrees, unless we want to look at only people 25+
  male_25_34_bach = "B15001_017",
  male_35_44_bach = "B15001_025",
  male_45_64_bach = "B15001_033",
  male_65plus_bach = "B15001_041",
  female_18_24_bach = "B15001_050",
  female_25_34_bach = "B15001_058",
  female_35_44_bach = "B15001_066",
  female_45_64_bach = "B15001_074",
  female_65plus_bach = "B15001_082",
  total_vacant = "B25005_001", # To get vacancy percentage
  white_total_units = "B25032A_001", # Need total units to get percentage of single, detached units and vacant units.
  white_single_family = "B25032A_002",
  black_total_units = "B25032B_001",
  black_single_family = "B25032B_002",
  native_total_units = "B25032C_001",
  native_single_family = "B25032C_002",
  asian_total_units = "B25032D_001",
  asian_single_family = "B25032D_002",
  islander_total_units = "B25032E_001",
  islander_single_family = "B25032E_002",
  other_total_units = "B25032F_001",
  other_single_family = "B25032F_002",
  multiracial_total_units = "B25032G_001",
  multiracial_single_family = "B25032G_002",
  medhhinc_white = "B19013A_001", # Median Household Income
  medhhinc_black = "B19013B_001",
  medhhinc_native = "B19013C_001", 
  medhhinc_asian = "B19013D_001", 
  medhhinc_other = "B19013F_001",  # There is no tract data for native hawiian/pacific islander, I'm including it with other
  medhhinc_multiracial = "B19013G_001", 
  white_pop = "B01001A_001",
  black_pop = "B01001B_001",
  native_pop = "B01001C_001",
  asian_pop = "B01001D_001",
  islander_pop = "B01001E_001",
  other_pop = "B01001F_001",
  multiracial_pop = "B01001G_001"
)

# We are grouping our data by tracts
philly_tract_acs <- get_acs(
  geography = "tract",
  state = "PA",
  variables = variables,
  year = 2022,
  survey = "acs5",
  cache_table = TRUE, 
  output = "wide"
)
# Summing up the variables that we need to create our percentage variables
philly_tract_acs <- philly_tract_acs %>%
  mutate(
    total_poverty = poverty_whiteE + poverty_blackE + poverty_nativeE + poverty_asianE + poverty_islanderE + poverty_otherE + poverty_multiracialE, # Adding all poverty populations together 
    
    total_bach = male_18_24_bachE + male_25_34_bachE + male_35_44_bachE + male_45_64_bachE + male_65plus_bachE + female_18_24_bachE + female_25_34_bachE + female_35_44_bachE + female_45_64_bachE + female_65plus_bachE, #Adding all bachelors degrees together
    
    total_units = white_total_unitsE + black_total_unitsE + native_total_unitsE + asian_total_unitsE + islander_total_unitsE + other_total_unitsE + multiracial_total_unitsE, # Total housing units
    
    total_single_family = white_single_familyE + black_single_familyE + native_single_familyE + asian_single_familyE + islander_single_familyE + other_single_familyE + multiracial_single_familyE # Total single family homes
  )
# Creating our variables that we are going to analyze
philly_tract_acs <- philly_tract_acs %>%
  mutate(
    pct_poverty = (total_poverty/total_popE)*100, # Divide total poverty population by total population

    pct_bach = (total_bach/total_popE)*100, # Divide bachelor degree holders by total population
    
    pct_vacant = (total_vacantE/total_units)*100, # Divide vacant units by total housing units
    pct_vacant = ifelse(is.infinite(pct_vacant) | total_vacantE > total_units, 100, pct_vacant), # Fixing errors when units equal zero or high MOE
    
    pct_single_family = (total_single_family/total_units)*100, # Divide single family homes by total housing units
    
    medhhinc = 
  (ifelse(is.na(medhhinc_whiteE), 0, medhhinc_whiteE) * white_popE +
   ifelse(is.na(medhhinc_blackE), 0, medhhinc_blackE) * black_popE +
   ifelse(is.na(medhhinc_nativeE), 0, medhhinc_nativeE) * native_popE +
   ifelse(is.na(medhhinc_asianE), 0, medhhinc_asianE) * asian_popE +
   ifelse(is.na(medhhinc_otherE), 0, medhhinc_otherE) * (islander_popE + other_popE) +
   ifelse(is.na(medhhinc_multiracialE), 0, medhhinc_multiracialE) * multiracial_popE) / total_popE)
# For median household income, I had to turn all median household incomes that were NA to 0, so that it would not mess up the formula. 
# Multiplying median household income times population by race. There was no islander median household income, so I included it in other. All divided by the total population, to get the total median household income. 
# Creating a summary table 
philly_acs_summary <- philly_tract_acs %>%
  select(
    GEOID, 
    NAME,
    pct_poverty,
    pct_bach,
    pct_vacant,
    pct_single_family,
    medhhinc
  )

# Get after acs dimension.
after_acs_dimension <- dim(philly_acs_summary)
# Join primary and census data.
final_data <- residential_points %>%
  left_join(philly_acs_summary, by = "GEOID") %>%
  select(-c(sale_date, year_built, ALAND, AWATER,
            INTPTLAT, INTPTLON, NAME.x, LISTNAME, NAME.y,
            Shape_Leng, Shape_Area)
         )
# Create key variables list.
key_columns <- c("sale_price", "ln_sale_price", "square_feet", "ln_square_feet",
                 "bath_num", "fireplace_num", "garage_num", "ac_binary",
                 "story_num", "age", "city_dist_mi", "ln_city_dist",
                 "septa_half_mi", "ln_septa_half_mi", "parks_mile", "ln_park_dist",
                 "knn_amenity_mi", "pct_poverty", "pct_bach",
                 "pct_vacant", "pct_single_family", "medhhinc",
                 "basement_type", "fuel_type", "neighborhood_fe", "quarters_fe")

# Reorder for key columns first and drop all rows with NA because OLS needs complete observations.
final_data <- final_data %>%
  select(any_of(key_columns), everything()) %>%
  na.omit(.)

# Get final dimension.
final_dimension <- dim(final_data)
# Separate before/after dimensions for data.
dimensions <- data.frame(
  rows_columns = c("Rows", "Columns"),
  "Property Data Before" = og_property_dimension,
  "Property Data After" = after_property_dimension,
  "Property Data After Feature Engineering" = after_feat_eng_dimension,
  "ACS Data Before" = og_acs_dimension,
  "ACS Data After" = after_acs_dimension,
  "Final Data" = final_dimension
)

# Make Kable of dimensions.
dimensions_kable <- kable(dimensions,
                          col.names = c("Dimensions", "Property Data Before", "Property Data After",
                                        "Property Data After Feature Engineering",
                                        "ACS Data Before", "ACS Data After", "Final Data"),
                          digits = 2,
                          caption = "Before and After Data Dimensions",
                          format.args = list(big.mark = ",")
) %>%
  kable_styling(latex_options = "striped",
                full_width = FALSE) %>%
  column_spec(1, bold = TRUE) %>%
  row_spec(0, color = "#f5f4f0", background = "#ff4100", bold = TRUE)

dimensions_kable
Before and After Data Dimensions
Dimensions Property Data Before Property Data After Property Data After Feature Engineering ACS Data Before ACS Data After Final Data
Rows 560,961 25,029 13,799 28,261 3,446 13,798
Columns 79 17 34 4 7 29

Census:

Using tidycensus, we imported all variables that aligned with our structural data from 2023 ACS data by tracts: median household income, total population, poverty by ethnicity (White, Black, Native American, Asian, Pacific Islander, “Other,” Multiracial), males and females aged 18–65+ with bachelor’s degrees or higher, total vacancy, and total housing units by ethnicity, as well as single-family households and median household income per ethnic group. We compiled the individual poverty, bachelor’s degree, unit, and single-family household counts by ethnicity to form the following percentage variables: total_poverty, total_bach, total_units, and total_single_family.

From this, we calculated pct_poverty, pct_bach, pct_vacant (accounting for ACS errors), pct_single_family, and medhhinc, transforming NAs to 0 for regression analysis.

Using our residential property vector data (which includes structural, spatial, and feature-engineered variables), we performed a left join on the cleaned ACS summary data by GEOID.

After organizing the final dataset so key variables appear first, we generated a kable summarizing the workflow. The final dataset contains 26,344 observations and 29 columns.

Exploratory Data Analysis (EDA)

Visualizations

# Distribution of Sale Prices Histogram

sale_price_hist <- ggplot(final_data, aes(x = sale_price)) +
  geom_histogram(fill = "#6BAED6", color = "white", bins = 25) +
  labs(title = "Distribution of Sale Prices",
       x = "Sale Price ($)",
       y = "Count") +
  scale_x_continuous(labels = scales::dollar) +
  scale_y_continuous(labels = scales::comma) +
  theme_minimal(base_size = 13) +
  theme(
    panel.grid.major = element_line(color = "gray90", size = 0.3),
    panel.grid.minor = element_line(color = "gray95", size = 0.2),
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14, margin = margin(b = 10)),
    axis.title.x = element_text(margin = margin(t = 10)),
    axis.title.y = element_text(margin = margin(r = 10))
    )

sale_price_hist

Distribution of Sales Prices - Histogram

The distribution of sales prices is heavily skewed right, with a high percentage of thee transactions below $500,000, and a relatively small number of homes falling in the multi-million dollar range. This chart also also gives insight into the potential presence of high leverage outliers falling toward the left. A concentration of thousands of sales clustered on the lower end of the price ranges suggests strong segmentation, meaning that the market is split into distinct groups, where you see a large cluster of homes that are relatively similar and a smaller but very different cluster of high-value properties with luxury features and few homes in-between, the market looks like separate groups rather than one gradually increasing scale. It’s important to note that this spread likely indicates heteroskedasticity, and would require a log transformation for statistical inference.

# Geographic distribution map - Sale Price

tmap_mode("plot")

sale_price_map <- tm_shape(philly_neighborhoods) +
  tm_polygons(col = "gray95", border.col = "gray65") +
  tm_shape(final_data) +
  tm_dots(col = "sale_price",
          palette = "YlOrRd",
          size = 0.05,
          style = "quantile",
          col.title = "Sale Price ($)") +
  tm_layout(main.title = "Geographic Distribution of Sale Prices",
            main.title.position = "center",
            main.title.size = 1.2,
            legend.outside = TRUE,
            frame = FALSE,
            bg.color = "#f5f4f0")

sale_price_map

#tmap_save(tm = sale_price_map, filename = "slide_images/geo-price-map.png", width = 10, height = 6, units = "in", dpi = 300, bg = "transparent")

Geographic Distribution of Sales Prices - Map

This map displays a high variation of sales prices distributed throughout the city. The higher priced clustered areas are the Center City, University City, the riverfront, and affluent pockets of the Northwest, potentially because these areas provide easy access to transit, employment centers, and cultural amenities. The northern area stretching above Broad Street into parts of West and North Philadelphia displays lower-priced homes, potentially reflecting long-term disinvestment, high vacancy rates, and aging non-renovated housing stock. The spatial clustering suggests that sale price is place-dependent in Philadelphia, mostly due to neighborhood qualities, and fixed effects.

# Sale Price vs structural features scatter plot

# Sale Price vs. Square feet 
price_v_sqft_plot <- ggplot(final_data, aes(x = square_feet, y = sale_price)) +
  geom_point(alpha = 0.4, color = "#08519C", size = 1.3) +
  geom_smooth(method = "lm", se = FALSE, color = "red", linewidth = 1) +
  labs(title = "Sale Price vs. Square Feet",
       x = "Square Feet",
       y = "Sale Price ($)") +
  scale_x_continuous(labels = scales::comma) +
  scale_y_continuous(labels = scales::dollar) +
  theme_minimal(base_size = 13) +
  theme(
    panel.background = element_rect(fill = "#f5f4f0"),
    panel.grid.major = element_line(color = "gray90", size = 0.3),
    panel.grid.minor = element_line(color = "gray95", size = 0.2),
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14, margin = margin(b = 10)),
    axis.title.x = element_text(margin = margin(t = 10)),
    axis.title.y = element_text(margin = margin(r = 10))
    )

price_v_sqft_plot

#ggsave("slide_images/price-v-sqft.png", plot = price_v_sqft_plot, width = 6, height = 4, units = "in", dpi = 300)

Sale Price vs. Square Feet

This scatter plot highlights the importance of home size as a structural indicator of value, even though the relationship may not be linear. There is a dense concentration of homes clustered below 3,000 sq ft. and under $500,000 consistent with the sales price distribution. Here we are seeing the same strong skew to the right, displaying a relationship between the two variables. The upward trend is fairly obvious, larger homes = an increase in price, however the spread gets wider as square footage increases, indicating that while square footage is positively associated with price, the weaker relationship among larger properties suggests that additional living space contributes less to value once the home reaches a certain size category.

# Sale Price vs. Number of Fireplaces 

price_v_fire_plot <- ggplot(final_data, aes(x = fireplace_num, y = sale_price)) +
  geom_jitter(alpha = 0.5, color = "#08519C", size = 1.3, width = 0.15, height = 0) +
  geom_smooth(method = "lm", se = FALSE, color = "red", linewidth = 1) +
  labs(title = "Sale Price vs. Number of Fireplaces",
       x = "Number of Fireplaces",
       y = "Sale Price ($)") +
  scale_x_continuous(labels = scales::comma) +
  scale_y_continuous(labels = scales::dollar) +
  theme_minimal(base_size = 13) +
  theme(
    panel.background = element_rect(fill = "#f5f4f0"),
    panel.grid.major = element_line(color = "gray90", size = 0.3),
    panel.grid.minor = element_line(color = "gray95", size = 0.2),
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14, margin = margin(b = 10)),
    axis.title.x = element_text(margin = margin(t = 10)),
    axis.title.y = element_text(margin = margin(r = 10))
    )

price_v_fire_plot

#ggsave("slide_images/price-v-fire.png", plot = price_v_fire_plot, width = 6, height = 4, units = "in", dpi = 300)

Sale Price vs. Number of Fireplaces

This chart displays the positive relationship between the value of a home in relation to the quality and character of aesthetic features. Nearly all homes with no fireplaces remain in the lower-mid price range, and once a home has two or more fire places the sale price increases to a much higher range. Homes in Philadelphia with several fireplaces are usually bigger, older, homes with higher-end finishes, meaning that fireplace count can serve as a secondary and indirect indicator of several other indicators that influence home value, and this is the reason for the high level of noise on the chart.

# Sale Price vs. Spatial Features

# Sale Price vs Distance to city center
price_v_spatial_plot <- ggplot(final_data, aes(x = city_dist_mi, y = sale_price)) +
  geom_point(alpha = 0.4, color = "#41AB5D", size = 1.3) +
  geom_smooth(method = "lm", se = FALSE, color = "red", linewidth = 1) +
  labs(title = "Sale Price vs. Distance to Downtown",
       x = "ln Distance to City Center, mi",
       y = "Sale Price") +
  scale_x_continuous(labels = scales::comma) +
  scale_y_continuous(labels = scales::dollar) +
  theme_minimal(base_size = 13) +
  theme(
    panel.grid.major = element_line(color = "gray90", size = 0.3),
    panel.grid.minor = element_line(color = "gray95", size = 0.2),
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14, margin = margin(b = 10)),
    axis.title.x = element_text(margin = margin(t = 10)),
    axis.title.y = element_text(margin = margin(r = 10))
    )

price_v_spatial_plot

Sale Price vs. Distance to City Center

The plots shows a weak negative relationship between price and distance to the city center. High valued homes fall both close to downtown and well outside of it, highlighting the structure of Philadelphia’s housing market as it relates to location. It’s not a city where closer is always better, instead certain neighborhoods like chestnut hill maintain their high premiums due to neighborhood reputation, and school quality. This pattern suggests that there are many fixed affects at play here, and it is important to note that similar patterns were displayed among many spatial variables.

# Sale Price vs. Distance to Parks - Park Accessibility

price_v_park_plot <- ggplot(final_data, aes(x = parks_mile, y = sale_price)) +
  geom_point(alpha = 0.4, color = "#238B45", size = 1.3) +
  geom_smooth(method = "lm", se = FALSE, color = "red", linewidth = 1) +
  labs(title = "Sale Price vs. Distance to Nearest Park",
       x = "Distance to Nearest Park (mi)",
       y = "Sale Price") +
  scale_x_continuous(labels = scales::comma) +
  scale_y_continuous(labels = scales::dollar) +
  theme_minimal(base_size = 13) +
  theme(
    panel.grid.major = element_line(color = "gray90", size = 0.3),
    panel.grid.minor = element_line(color = "gray95", size = 0.2),
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14, margin = margin(b = 10)),
    axis.title.x = element_text(margin = margin(t = 10)),
    axis.title.y = element_text(margin = margin(r = 10))
    )

price_v_park_plot

Sale Price vs. Distance to Parks - Park Accessibility

The plot suggests that distance to parks has very little trend with sale price. Many high-priced properties sit both very close and very far away from parks, suggesting that park accessibility alone may not hold much value. This could be correlated with the fact that some parks are major attractions; i.e Fairmount Park, while others have limited impact on neighborhood desirability, especially in areas of low investment. Because of the difference in park quality across the city, the distance metric may not represents the true relationship, and more neighborhood or amenity variables may be required to capture environmental quality more accurately.

# Median income vs Sale Price per neighborhood

inc_v_price_plot <- ggplot(final_data, aes(x = medhhinc, y = sale_price, color = neighborhood_fe)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  labs(title = "Relationship Between Median Income and Sale Price by Neighborhood",
       x = "Median Household Income ($)",
       y = "Sale Price") +
  scale_x_continuous(labels = scales::dollar) +
  scale_y_continuous(labels = scales::dollar) +
  scale_color_viridis_d(option = "turbo") +
  theme_minimal(base_size = 13) +
  theme(
    panel.grid.major = element_line(color = "gray90", size = 0.3),
    panel.grid.minor = element_line(color = "gray95", size = 0.2),
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14, margin = margin(b = 10)),
    axis.title.x = element_text(margin = margin(t = 10)),
    axis.title.y = element_text(margin = margin(r = 10)),
    legend.position = "bottom",
    legend.text = element_text(size = 7),
    legend.key.size = unit(0.4, "cm"),
    legend.box = "horizontal"
    ) +
  guides(color = guide_legend(ncol = 8, byrow = TRUE))

inc_v_price_plot

Median Income vs Sale Price per neighborhood

This plot further advances the claim that even with a lot of scatter in the points, there is a noticeable upward trend of homes in higher-income neighborhoods selling for higher prices. Many of the neighborhoods cluster in specific ranges of the price scale, even when income levels are similar, suggesting that the housing market in Philadelphia is dependent not only on demographics or income, but place effects, where reputation or fixed effects in a neighborhood increase or reduce prices. An example of this is some neighborhoods with moderate household incomes still show clusters of high-value transactions, while others with similar incomes remain in the lower end of the market, again highlighting other factors and fixed effects like transit access, historical character, and school quality. The main takeaway is that while higher-income neighborhood tend to have homes with higher sale prices, neighborhood identity also plays a big role in sale price.

# Spatial Relationship Between sale price and structural predictors

tmap_mode("plot")

# Sale Price
map_value <- tm_shape(philly_tracts[philly_tracts$COUNTYFP == 101, ]) +
  tm_polygons(col = "gray90", border.col = "gray45", lwd = 0.5) +
  tm_shape(final_data) +
  tm_dots(col = "sale_price",
          palette = "YlOrRd",
          style = "quantile",
          size = 0.04,
          title = "Sale Price") +
  tm_layout(main.title = "Distribution of Sale Prices",
            main.title.position = "center",
            main.title.size = 1.2,
            legend.outside = TRUE,
            frame = FALSE)
# Bathrooms Predictor

map_bath <-  tm_shape(philly_tracts[philly_tracts$COUNTYFP == 101, ]) +
  tm_polygons(col = "gray90", border.col = "gray45", lwd = 0.5) +
  tm_shape(final_data) +
  tm_dots(col = "bath_num",
          palette = "Blues",
          style = "quantile",
          size = 0.04,
          title = "Number of Bathrooms") +
  tm_layout(main.title = "Distribution of Bathrooms",
            main.title.position = "center",
            main.title.size = 1.2,
            legend.outside = TRUE,
            frame = FALSE)
# Stories Predictor

map_story <-  tm_shape(philly_tracts[philly_tracts$COUNTYFP == 101, ]) +
  tm_polygons(col = "gray90", border.col = "gray45", lwd = 0.5) +
  tm_shape(final_data) +
  tm_dots(col = "story_num",
          palette = "Greens",
          style = "quantile",
          size = 0.04,
          title = "Number of Stories") +
  tm_layout(main.title = "Distribution of Stories",
            main.title.position = "center",
            main.title.size = 1.2,
            legend.outside = TRUE,
            frame = FALSE)
# Display them side by side

tmap_arrange(map_value, map_bath, map_story, ncol = 3)

map_struct <- tmap_arrange(map_value, map_bath, map_story, ncol = 3)

Spatial Relationship Between sale price and structural predictors (bathrooms and stories) - Map

This figure shows how housing characteristics are clustered across Philadelphia. Here we can see the sale price distribution directly against certain structural features. Bathrooms and number of stories follow similar geographic patterns, as sale price, areas with higher sales prices tend to have bigger homes with more bathrooms and more stories. In contrast, neighborhoods where homes have less structural amenities also have subsequently lower sale prices. This group of maps makes a strong visual argument that structural features and neighborhood context evolve together, supporting the modeling strategy of incorporating more structural than spatial predictors.

# Sale price histogram.
price_hist <- ggplot(residential_points, aes(x = sale_price)) +
  geom_histogram(fill = "pink", color = "white") +
  labs(title = "Histogram of Sale Price", x = "Sale Price ($)", y = "Count") +
  theme_minimal()

# Log sale price histogram.
ln_price_hist <- ggplot(residential_points, aes(x = ln_sale_price)) +
  geom_histogram(fill = "pink", color = "white") +
  labs(title = "Histogram of ln(Sale Price)", x = "ln(Sale Price)", y = "Count") +
  theme_minimal()

grid.arrange(price_hist, ln_price_hist, ncol = 2)

The raw distribution of sale prices is right-skewed with a median of 250,000 USD and a mean of 343,867 USD. There is a substantial gap in price between the third quartile (375,000 USD) and the maximum price (15,428,633 USD). While 75% of houses sold for 375,000 USD or less, the upper 25% exhibits considerable variability, with prices ranging up to 15.4 million USD, affecting the tail distribution. This suggests that a small number of luxury properties are affecting the distribution of the sales price data To address this skewness and improve model performance, we performed a log transformation, making our data closer to normal by compressing the scale of higher values, emphasizing a standardized change in percentage over dollar amount.

# Livable space histogram.
livable_area_hist <- ggplot(residential_points, aes(x = square_feet)) +
  geom_histogram(fill = "wheat", color = "white") +
  labs(title = "Histogram of Square Feet", x = "Square Feet", y = "Count") +
  theme_minimal()

# Log livable space histogram.
ln_square_feet_hist <- ggplot(residential_points, aes(x = ln_square_feet)) +
  geom_histogram(fill = "wheat", color = "white") +
  labs(title = "Histogram of ln(Square Feet)", x = "ln(Square Feet)", y = "Count") +
  theme_minimal()

grid.arrange(livable_area_hist, ln_square_feet_hist, ncol = 2)

The distribution of livable area is right-skewed, with a median of 1,216 sq ft and a mean of 1,372 sq ft. While 75% of homes are under 1,509 sq ft, a small number of larger properties extend the upper tail of the distribution. Notably, homes above 3,000 sq ft become increasingly sparse, suggesting a separation between standard and luxury housing markets. We applied a log transformation to this variable to create a more symmetric distribution by compressing the scale of larger homes, which improves linearity in our model and allows coefficients to represent proportional rather than absolute changes in square footage.

# Distance to downtown histogram.
downtown_dist_hist <- ggplot(residential_points, aes(x = city_dist_mi)) +
  geom_histogram(fill = "darkblue", color = "white") +
  labs(title = "Histogram of Downtown Distance", x = "Downtown Distance (mi)", y = "Count") +
  theme_minimal()

# Log distance to downtown histogram.
ln_downtown_dist_hist <- ggplot(residential_points, aes(x = ln_city_dist)) +
  geom_histogram(fill = "darkblue", color = "white") +
  labs(title = "Histogram of ln(Downtown Distance)", x = "ln(Downtown Distance)", y = "Count") +
  theme_minimal()

grid.arrange(downtown_dist_hist, ln_downtown_dist_hist, ncol = 2)

The distribution of distance to Center City (City Hall) is right-skewed, with fewer observations of houses occurring as the distance from Center City/City Hall increases. The effect of distance on housing prices is non-linear: being 1 mile from Center City has a larger impact on price than being 6 vs. 11 miles out. To account for this, we applied a log transformation, which compresses the upper tail, creates a more symmetric distribution, and reduces the influence of extreme distances. This transformation improves linearity in our regression model and allows coefficients to be interpreted as proportional changes in price per proportional change in distance.

# SEPTA buffer histogram.
septa_buffer_hist <- ggplot(residential_points, aes(x = septa_half_mi)) +
  geom_histogram(fill = "azure3", color = "white") +
  labs(title = "Histogram of SEPTA 0.5mi Buffer", x = "SEPTA 0.5mi Buffer (mi)", y = "Count") +
  theme_minimal()

# Log SEPTA buffer histogram.
ln_septa_buffer_hist <- ggplot(residential_points, aes(x = ln_septa_half_mi)) +
  geom_histogram(fill = "azure3", color = "white") +
  labs(title = "Histogram of ln(SEPTA 0.5mi Buffer)", x = "ln(SEPTA 0.5mi Buffer)", y = "Count") +
  theme_minimal()

grid.arrange(septa_buffer_hist, ln_septa_buffer_hist, ncol = 2)

The distribution of SEPTA access within a 0.5-mile buffer of each property is right-skewed, with a median of 44 and a mean of 52.5. While most properties have between 29 and 69 nearby SEPTA access points, there is substantial variation ranging from zero (likely remote suburban properties) to over 160 in the most transit-dense neighborhoods. The log transformation compresses this wide range and produces a more symmetric distribution, which is appropriate given that transit accessibility exhibits diminishing returns—the marginal benefit of additional access decreases as the total number increases.

# kNN amenities histogram.
knn_amenities_hist <- ggplot(residential_points, aes(x = knn_amenity_mi)) +
  geom_histogram(fill = "darkseagreen", color = "white") +
  labs(title = "Histogram of kNN Amenities", x = "kNN Amenities (mi)", y = "Count") +
  theme_minimal()

grid.arrange(knn_amenities_hist)

For the kNN Amenities variable, the mean distance to the nearest amenity per household is 0.31 miles, and the median is 0.27 miles. Seventy-five percent of households are within 0.42 miles of any of the 23 amenities described in the data cleaning section. These statistics showcase Philadelphia’s reputation as a highly walkable city. Observations beyond 1 mile typically reflect suburban or rural settings. We did not transform this variable, instead using the raw distances to preserve the direct relationship between amenity proximity and property values.The kNN approach inherently reflects local amenity density, with shorter distances in denser areas and longer distances where households are more dispersed.

# Distance to park histogram..
parks_dist <- ggplot(residential_points, aes(x = parks_mile)) +
  geom_histogram(fill = "darkgreen", color = "white") +
  labs(title = "Histogram of Parks Distance", x = "Parks Distance (mi)", y = "Count") +
  theme_minimal()

# Log distance to park histogram.
ln_parks_dist <- ggplot(residential_points, aes(x = ln_park_dist)) +
  geom_histogram(fill = "darkgreen", color = "white") +
  labs(title = "Histogram of ln(Parks Distance)", x = "Parks Distance", y = "Count") +
  theme_minimal()

grid.arrange(parks_dist, ln_parks_dist, ncol = 2)

The distribution of the variable Distance from Parks by miles is showing a slight right skew. Between the third quartile and the maximum there is a jump of about 3 miles, indicating outliers beyond 2 miles. Due to this, we chose to log the variable to ensure it is normal for our model.

# Number of bathrooms histogram.
ggplot(residential_points, aes(x = bath_num)) +
  geom_histogram(fill = "gold", color = "white") +
  labs(title = "Histogram of Bathrooms", x = "Bathrooms", y = "Count") +
  theme_minimal()

The histogram showcases the number of observations of properties with n bathrooms. Most observations exhibit 1 or 2 bathrooms per property, with a an outlier of 8 bathrooms.

# Number of fireplaces histogram.
ggplot(residential_points, aes(x = fireplace_num)) +
  geom_histogram(fill = "darkred", color = "white") +
  labs(title = "Histogram of Fireplaces", x = "Fireplaces", y = "Count") +
  theme_minimal()

This histogram showcases the number of observations with n fireplaces. Most observations contain 0 fireplaces, with outliers of property(ies) containing 2 or more.

# Number of garages histogram.
ggplot(residential_points, aes(x = garage_num)) +
  geom_histogram(fill = "gray", color = "white") +
  labs(title = "Histogram of Garages", x = "Garages", y = "Count") +
  theme_minimal()

This histogram showcases the number of garages available per observation. The median is 0 garages per property, with a maximum of 5 garages per property.

# Number of stories histogram.
ggplot(residential_points, aes(x = story_num)) +
  geom_histogram(fill = "orange", color = "white") +
  labs(title = "Histogram of Stories", x = "Stories", y = "Count") +
  theme_minimal()

This histogram showcases the number of observations that contain n housing stories. The median is 2 stories per property, with a maximum of 5 stories.

# Age histogram.
ggplot(residential_points, aes(x = age)) +
  geom_histogram(fill = "black", color = "white") +
  labs(title = "Histogram of Age", x = "Age", y = "Count") +
  theme_minimal()

This histogram showcases the amount of observations with their calculated ages (years) based off year built. The median age is 100 years, and the maximum is 275 years. This was then transformed into a polynomial variable to account for the decrease in housing price as the home ages until a certain age, then the price rises again.

# Histogram for pct_poverty
ggplot(philly_acs_summary, aes(x = pct_poverty)) +
  geom_histogram(fill = "skyblue", color = "white") +
  labs(title = "Histogram of Percent Poverty", x = "Percent Poverty (%)", y = "Count") +
  theme_minimal()

This histogram presents the tracts with their determined percent poverty. The percentage is suspiciously high with the range from the minimum (0%) from the first quartile is 98 percent.We also have 32 tracts with no data.

# Histogram for pct_bach
ggplot(philly_acs_summary, aes(x = pct_bach)) +
  geom_histogram(binwidth = 5, fill = "lightgreen", color = "white") +
  labs(title = "Histogram of Percent Bachelor Degree Holders", x = "Percent Bachelor Degree Holders (%)", y = "Count") +
  theme_minimal()

This histogram shows the distribution of percent of bachelor + degree holders per census tract. The median is 13.5 %, with most tracts falling between the minimum and third quartile. We also have 32 tracts with no data.

# Histogram for pct_vacant
ggplot(philly_acs_summary, aes(x = pct_vacant)) +
  geom_histogram(binwidth = 10, fill = "salmon", color = "white") +
  labs(title = "Histogram of Percent Vacant Units", x = "Percent Vacant (%)", y = "Count") +
  theme_minimal()

This histogram shows the distribution of percent vacant residences within the census tracts. The median percentage of vacant homes is 8 percent. We also have 44 tracts with no data.

# Histogram for pct_single_family
ggplot(philly_acs_summary, aes(x = pct_single_family)) +
  geom_histogram(binwidth = 5, fill = "orange", color = "white") +
  labs(title = "Histogram of Percent Single-Family Homes ", x = "Percent Single-Family (%)", y = "Count") +
  theme_minimal()

This histogram represents the percent of single families per tract. The median for percent single families per tract is 66.58%. We also have 45 tracts with no data.

# Histogram for medhhinc
ggplot(philly_acs_summary, aes(x = medhhinc)) +
  geom_histogram(binwidth = 10000, fill = "purple", color = "white") +
  labs(title = "Histogram of Median Household Income", x = "Median Household Income ($)", y = "Count") +
  theme_minimal()

This histogram represents the median household income value per census tract. The median of the median household income value is $66,795. Slightly right-skewed. We also have 32 tracts with no data.

# Residential property correlation.
corr_matrix_df <- final_data %>%
  select(1:22) %>%
  st_drop_geometry()

corr_matrix <- round(cor(na.omit(corr_matrix_df)), 2)

corr_matrix_plot <- ggcorrplot(corr_matrix,
                               title = "Correlation Matrix",
                               hc.order = FALSE,
                               method = "square",
                               lab = TRUE,
                               lab_size = 3,
                               colors = c("#6D9EC1", "white", "#E46726"),
                               ggtheme = ggplot2::theme_gray,
                               insig = "blank"
                               )

corr_matrix_plot

#ggsave("correlation_matrix.png", width = 8, height = 8, dpi = 400)

5 strongest correlations to price:

  • square feet

  • bathrooms

  • median income

  • percent bachelor’s

  • fireplaces

Seems like bachelor degrees are highly correlated with median income, and should be excluded as a predictor.

Model Building

Model Building Progression

Check for multicollinearity:

# VIF check for multicollinearity.
vif_check <- lm(ln_sale_price ~ ln_square_feet + bath_num + ac_binary + fireplace_num + story_num  + garage_num + ln_septa_half_mi + ln_park_dist + ln_city_dist + basement_type + fuel_type, data = residential_points)

vif(vif_check)
                     GVIF Df GVIF^(1/(2*Df))
ln_square_feet   2.142390  1        1.463690
bath_num         1.899785  1        1.378327
ac_binary        1.322354  1        1.149937
fireplace_num    1.255618  1        1.120544
story_num        1.484098  1        1.218236
garage_num       2.279258  1        1.509721
ln_septa_half_mi 3.029797  1        1.740631
ln_park_dist     1.042660  1        1.021107
ln_city_dist     3.377854  1        1.837894
basement_type    3.093711 10        1.058093
fuel_type        1.049525  3        1.008089
# Build Model step by step
# First Model (structural features only)

first_model <- lm(sale_price ~ ln_square_feet + bath_num + # Structural.
                    ac_binary + fireplace_num + story_num + garage_num + # Structural.
                    basement_type + fuel_type + # Categorical structural.
                    age + I(age^2), # Age polynomial.
                    data = final_data)

summary(first_model)

Call:
lm(formula = sale_price ~ ln_square_feet + bath_num + ac_binary + 
    fireplace_num + story_num + garage_num + basement_type + 
    fuel_type + age + I(age^2), data = final_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-948778  -77746  -12856   55399 4965521 

Coefficients:
                                          Estimate    Std. Error t value
(Intercept)                          -1597177.8938    47120.6098 -33.896
ln_square_feet                         269187.6859     6770.2012  39.761
bath_num                                50745.3700     2844.2275  17.842
ac_binary                               92075.8060     3512.9628  26.210
fireplace_num                          140541.8108     4554.0080  30.861
story_num                               35873.2911     3358.8392  10.680
garage_num                              73028.0005     4134.1809  17.664
basement_typeFull Finished             -56419.1300    10192.9544  -5.535
basement_typeFull Semi-Finished        -68960.0361    11754.8714  -5.867
basement_typeFull Unfinished           -71796.8037    10601.7781  -6.772
basement_typeFull Unknown Finish       -85567.6562    10969.5130  -7.800
basement_typePartial Finished         -123335.5158    10805.7207 -11.414
basement_typePartial Semi-Finished    -133994.8197    11268.2740 -11.891
basement_typePartial Unfinished       -129057.8554    13661.8798  -9.447
basement_typePartial Unknown Finish   -137683.1021    13300.7916 -10.351
basement_typeUnknown Size Finished      58223.7228    49739.4358   1.171
basement_typeUnknown Size Unfinished   -25195.3327    38771.7977  -0.650
fuel_typeElectric                       -9164.7673    10287.6921  -0.891
fuel_typeOil Heat                       -2182.1398    20595.6581  -0.106
fuel_typeOther                         257417.4414    62335.0474   4.130
age                                     -4476.7210      145.6790 -30.730
I(age^2)                                   26.3560        0.8026  32.838
                                                 Pr(>|t|)    
(Intercept)                          < 0.0000000000000002 ***
ln_square_feet                       < 0.0000000000000002 ***
bath_num                             < 0.0000000000000002 ***
ac_binary                            < 0.0000000000000002 ***
fireplace_num                        < 0.0000000000000002 ***
story_num                            < 0.0000000000000002 ***
garage_num                           < 0.0000000000000002 ***
basement_typeFull Finished            0.00000003167146392 ***
basement_typeFull Semi-Finished       0.00000000455286408 ***
basement_typeFull Unfinished          0.00000000001320276 ***
basement_typeFull Unknown Finish      0.00000000000000661 ***
basement_typePartial Finished        < 0.0000000000000002 ***
basement_typePartial Semi-Finished   < 0.0000000000000002 ***
basement_typePartial Unfinished      < 0.0000000000000002 ***
basement_typePartial Unknown Finish  < 0.0000000000000002 ***
basement_typeUnknown Size Finished                  0.242    
basement_typeUnknown Size Unfinished                0.516    
fuel_typeElectric                                   0.373    
fuel_typeOil Heat                                   0.916    
fuel_typeOther                        0.00003655727865199 ***
age                                  < 0.0000000000000002 ***
I(age^2)                             < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 175500 on 13776 degrees of freedom
Multiple R-squared:  0.5934,    Adjusted R-squared:  0.5927 
F-statistic: 957.2 on 21 and 13776 DF,  p-value: < 0.00000000000000022
# Build Model step by step
# Second Model (structural features + census features)

second_model <- lm(sale_price ~ ln_square_feet + bath_num + # Structural.
                    ac_binary + fireplace_num + story_num + garage_num + # Structural.
                    basement_type + fuel_type + # Categorical structural.
                    age + I(age^2) + # Age polynomial.
                    medhhinc + pct_vacant + pct_single_family, # Census feature.
                    data = final_data)

summary(second_model)

Call:
lm(formula = sale_price ~ ln_square_feet + bath_num + ac_binary + 
    fireplace_num + story_num + garage_num + basement_type + 
    fuel_type + age + I(age^2) + medhhinc + pct_vacant + pct_single_family, 
    data = final_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-988908  -66057   -4762   50765 5084141 

Coefficients:
                                           Estimate     Std. Error t value
(Intercept)                          -1652366.06614    43488.59929 -37.995
ln_square_feet                         252414.21462     6292.52465  40.113
bath_num                                56908.37082     2623.69081  21.690
ac_binary                               51554.86699     3351.19401  15.384
fireplace_num                          124884.39498     4234.54173  29.492
story_num                                8377.38556     3183.80795   2.631
garage_num                              78089.54533     3851.12691  20.277
basement_typeFull Finished             -14581.14843     9416.23875  -1.549
basement_typeFull Semi-Finished        -31903.14430    10846.75153  -2.941
basement_typeFull Unfinished           -29870.26998     9789.07396  -3.051
basement_typeFull Unknown Finish       -37259.53398    10137.55754  -3.675
basement_typePartial Finished          -85017.93568     9990.78528  -8.510
basement_typePartial Semi-Finished     -83737.55696    10442.37511  -8.019
basement_typePartial Unfinished        -77873.36014    12623.54906  -6.169
basement_typePartial Unknown Finish    -90447.97887    12288.99493  -7.360
basement_typeUnknown Size Finished      93761.87823    45773.56575   2.048
basement_typeUnknown Size Unfinished    17834.06219    35694.03271   0.500
fuel_typeElectric                        4102.48429     9491.30032   0.432
fuel_typeOil Heat                       -9950.88093    18954.82863  -0.525
fuel_typeOther                         146252.37514    57412.98694   2.547
age                                     -3218.69342      136.46375 -23.586
I(age^2)                                   19.95215        0.75287  26.501
medhhinc                                    2.61912        0.05504  47.587
pct_vacant                                495.26412      220.95040   2.242
pct_single_family                       -1819.23898      167.15962 -10.883
                                                 Pr(>|t|)    
(Intercept)                          < 0.0000000000000002 ***
ln_square_feet                       < 0.0000000000000002 ***
bath_num                             < 0.0000000000000002 ***
ac_binary                            < 0.0000000000000002 ***
fireplace_num                        < 0.0000000000000002 ***
story_num                                        0.008517 ** 
garage_num                           < 0.0000000000000002 ***
basement_typeFull Finished                       0.121522    
basement_typeFull Semi-Finished                  0.003274 ** 
basement_typeFull Unfinished                     0.002282 ** 
basement_typeFull Unknown Finish                 0.000238 ***
basement_typePartial Finished        < 0.0000000000000002 ***
basement_typePartial Semi-Finished    0.00000000000000115 ***
basement_typePartial Unfinished       0.00000000070693158 ***
basement_typePartial Unknown Finish   0.00000000000019421 ***
basement_typeUnknown Size Finished               0.040541 *  
basement_typeUnknown Size Unfinished             0.617339    
fuel_typeElectric                                0.665576    
fuel_typeOil Heat                                0.599606    
fuel_typeOther                                   0.010864 *  
age                                  < 0.0000000000000002 ***
I(age^2)                             < 0.0000000000000002 ***
medhhinc                             < 0.0000000000000002 ***
pct_vacant                                       0.025008 *  
pct_single_family                    < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 161500 on 13773 degrees of freedom
Multiple R-squared:  0.656, Adjusted R-squared:  0.6554 
F-statistic:  1094 on 24 and 13773 DF,  p-value: < 0.00000000000000022
# Build Model step by step
# Third Model (structural features + census features + spatial features)

third_model <- lm(sale_price ~ ln_square_feet + bath_num + # Structural.
                    ac_binary + fireplace_num + story_num + garage_num + # Structural.
                    basement_type + fuel_type + # Categorical structural.
                    age + I(age^2) + # Age polynomial.
                    medhhinc + pct_vacant + pct_single_family + # Census feature.
                    city_dist_mi + septa_half_mi + ln_park_dist + knn_amenity_mi, # Spatial 
                    data = final_data)

summary(third_model)

Call:
lm(formula = sale_price ~ ln_square_feet + bath_num + ac_binary + 
    fireplace_num + story_num + garage_num + basement_type + 
    fuel_type + age + I(age^2) + medhhinc + pct_vacant + pct_single_family + 
    city_dist_mi + septa_half_mi + ln_park_dist + knn_amenity_mi, 
    data = final_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-819360  -62929   -2749   50992 5092336 

Coefficients:
                                           Estimate     Std. Error t value
(Intercept)                          -1774925.12287    42233.04610 -42.027
ln_square_feet                         262440.17744     6035.63376  43.482
bath_num                                50664.20185     2522.45464  20.085
ac_binary                               45822.40283     3222.18735  14.221
fireplace_num                          129025.93647     4052.75284  31.837
story_num                               -3036.05627     3077.33629  -0.987
garage_num                              87437.98894     3695.25487  23.662
basement_typeFull Finished               5841.80894     9044.99353   0.646
basement_typeFull Semi-Finished         -4238.73810    10414.71253  -0.407
basement_typeFull Unfinished            -9483.81135     9410.95820  -1.008
basement_typeFull Unknown Finish       -16478.80828     9740.96075  -1.692
basement_typePartial Finished          -41443.86599     9649.71826  -4.295
basement_typePartial Semi-Finished     -38333.92770    10125.94885  -3.786
basement_typePartial Unfinished        -47590.95708    12107.05488  -3.931
basement_typePartial Unknown Finish    -57413.84143    11790.72725  -4.869
basement_typeUnknown Size Finished     119381.17238    43777.64920   2.727
basement_typeUnknown Size Unfinished    36746.39206    34137.09291   1.076
fuel_typeElectric                        3424.86104     9076.76134   0.377
fuel_typeOil Heat                       -2331.50864    18128.78591  -0.129
fuel_typeOther                         101793.28016    54918.28726   1.854
age                                     -2569.92375      131.88651 -19.486
I(age^2)                                   14.64481        0.73570  19.906
medhhinc                                    2.08753        0.05599  37.284
pct_vacant                              -1078.00250      225.51363  -4.780
pct_single_family                         717.73161      182.30169   3.937
city_dist_mi                            -2548.89825      765.01159  -3.332
septa_half_mi                            1838.15719       70.12560  26.212
ln_park_dist                           -16813.99559     2088.45496  -8.051
knn_amenity_mi                         -10877.25171     8548.59284  -1.272
                                                 Pr(>|t|)    
(Intercept)                          < 0.0000000000000002 ***
ln_square_feet                       < 0.0000000000000002 ***
bath_num                             < 0.0000000000000002 ***
ac_binary                            < 0.0000000000000002 ***
fireplace_num                        < 0.0000000000000002 ***
story_num                                        0.323863    
garage_num                           < 0.0000000000000002 ***
basement_typeFull Finished                       0.518380    
basement_typeFull Semi-Finished                  0.684018    
basement_typeFull Unfinished                     0.313596    
basement_typeFull Unknown Finish                 0.090725 .  
basement_typePartial Finished        0.000017602797039347 ***
basement_typePartial Semi-Finished               0.000154 ***
basement_typePartial Unfinished      0.000085061952320613 ***
basement_typePartial Unknown Finish  0.000001131770004475 ***
basement_typeUnknown Size Finished               0.006400 ** 
basement_typeUnknown Size Unfinished             0.281751    
fuel_typeElectric                                0.705940    
fuel_typeOil Heat                                0.897670    
fuel_typeOther                                   0.063826 .  
age                                  < 0.0000000000000002 ***
I(age^2)                             < 0.0000000000000002 ***
medhhinc                             < 0.0000000000000002 ***
pct_vacant                           0.000001769226310464 ***
pct_single_family                    0.000082894057054214 ***
city_dist_mi                                     0.000865 ***
septa_half_mi                        < 0.0000000000000002 ***
ln_park_dist                         0.000000000000000889 ***
knn_amenity_mi                                   0.203252    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 154400 on 13769 degrees of freedom
Multiple R-squared:  0.6856,    Adjusted R-squared:  0.685 
F-statistic:  1072 on 28 and 13769 DF,  p-value: < 0.00000000000000022
# Build Model step by step
# Final Model 
## (structural features + census features + spatial features + interactions and fixed effects)

final_model <- lm(sale_price ~ ln_square_feet + bath_num + # Structural.
                    ac_binary + fireplace_num + story_num + garage_num + # Structural.
                    basement_type + fuel_type + # Categorical structural.
                    age + I(age^2) + # Age polynomial.
                    medhhinc + pct_vacant + pct_single_family + # Census feature.
                    city_dist_mi + septa_half_mi + ln_park_dist + knn_amenity_mi + # Spatial 
                    knn_amenity_mi * city_dist_mi + septa_half_mi * city_dist_mi + # Interaction between amenities and downtown distance.
                    neighborhood_fe + quarters_fe, # Fixed effect 
                    data = final_data)

summary(final_model)

Call:
lm(formula = sale_price ~ ln_square_feet + bath_num + ac_binary + 
    fireplace_num + story_num + garage_num + basement_type + 
    fuel_type + age + I(age^2) + medhhinc + pct_vacant + pct_single_family + 
    city_dist_mi + septa_half_mi + ln_park_dist + knn_amenity_mi + 
    knn_amenity_mi * city_dist_mi + septa_half_mi * city_dist_mi + 
    neighborhood_fe + quarters_fe, data = final_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-800100  -47565    1786   43876 5123415 

Coefficients:
                                                 Estimate     Std. Error
(Intercept)                                -1677905.26604    45438.44126
ln_square_feet                               265516.73413     5665.17910
bath_num                                      43679.15286     2304.27405
ac_binary                                     38243.85796     3013.01343
fireplace_num                                 89416.98302     3844.08168
story_num                                     -4012.49519     2930.35076
garage_num                                    76977.50398     3398.82160
basement_typeFull Finished                    98848.82084     8801.51291
basement_typeFull Semi-Finished               89048.26055     9947.99990
basement_typeFull Unfinished                  78539.61914     9106.34836
basement_typeFull Unknown Finish              70893.15476     9360.82346
basement_typePartial Finished                 46110.11085     9306.38719
basement_typePartial Semi-Finished            42411.40202     9809.19885
basement_typePartial Unfinished               34988.53944    11384.80073
basement_typePartial Unknown Finish           27282.08741    11123.37542
basement_typeUnknown Size Finished           163213.61867    39526.77205
basement_typeUnknown Size Unfinished         129158.32386    30871.19004
fuel_typeElectric                            -10432.86061     8222.22764
fuel_typeOil Heat                              2814.52215    16253.04735
fuel_typeOther                                 4786.99869    49366.45256
age                                           -1443.27298      129.58915
I(age^2)                                          5.65537        0.74073
medhhinc                                          0.60278        0.08432
pct_vacant                                     -958.86594      293.08784
pct_single_family                               208.48015      214.04636
city_dist_mi                                  -4831.88421     3209.73761
septa_half_mi                                  1374.82535      193.27127
ln_park_dist                                  -6173.96721     4828.54465
knn_amenity_mi                               -19099.00935    24097.41716
neighborhood_feAcademy Gardens                32027.47498    28958.45924
neighborhood_feAllegheny West                -82852.95327    19864.95272
neighborhood_feAndorra                       -41620.14508    30515.82310
neighborhood_feAston-Woodbridge               30554.10136    33071.19283
neighborhood_feBella Vista                    90398.90913    24431.54644
neighborhood_feBelmont                       -79559.84808    41026.83191
neighborhood_feBrewerytown                   -75177.75115    18778.66325
neighborhood_feBridesburg                      2550.47870    22106.85981
neighborhood_feBurholme                      -16608.94664    36000.13702
neighborhood_feBustleton                      22634.66033    24578.34911
neighborhood_feCarroll Park                  -83609.04139    20510.34719
neighborhood_feCedar Park                     40384.16990    21412.20070
neighborhood_feCedarbrook                    -24084.53071    23444.18363
neighborhood_feChestnut Hill                 398379.01091    20914.15335
neighborhood_feClearview                     -89401.47400    31878.53628
neighborhood_feCobbs Creek                   -84130.97360    16115.13817
neighborhood_feDickinson Narrows             -47829.28868    20588.91637
neighborhood_feDunlap                       -155437.63049    34707.96052
neighborhood_feEast Germantown               -58167.55845    22366.28386
neighborhood_feEast Kensington               -27422.36849    18544.82729
neighborhood_feEast Mount Airy               -46255.97004    18245.76517
neighborhood_feEast Oak Lane                -117286.06804    29080.08570
neighborhood_feEast Parkside                -103891.19354    42835.19478
neighborhood_feEast Passyunk                  21118.42456    19723.91233
neighborhood_feEastwick                      -70465.06053    34283.91465
neighborhood_feElmwood                       -70679.74484    18135.97829
neighborhood_feFairhill                      -75595.59992    39476.09467
neighborhood_feFairmount                      78693.54567    18059.71450
neighborhood_feFeltonville                   -66975.84665    19927.49294
neighborhood_feFern Rock                     -85933.82948    34257.10976
neighborhood_feFishtown - Lower Kensington     8972.52741    15418.08785
neighborhood_feFitler Square                 432714.24530    30533.60901
neighborhood_feFox Chase                      -6599.14402    21367.54777
neighborhood_feFrancisville                  -65443.08187    23431.28042
neighborhood_feFrankford                     -67190.77704    18643.04056
neighborhood_feFranklinville                -133261.21143    30027.28329
neighborhood_feGermantown - Morton           -73599.31154    26838.56583
neighborhood_feGermantown - Penn Knox       -144873.12243    44167.07966
neighborhood_feGermantown - Westside        -104367.91668    39166.79656
neighborhood_feGermany Hill                    5748.73880    26011.63938
neighborhood_feGirard Estates                -33080.07340    18859.81111
neighborhood_feGlenwood                     -121793.12479    28779.82094
neighborhood_feGraduate Hospital              95016.44133    19878.73737
neighborhood_feGrays Ferry                   -78925.37179    17380.75802
neighborhood_feGreenwich                     -66132.85433    25126.75702
neighborhood_feHaddington                    -91375.60002    18470.45566
neighborhood_feHarrowgate                    -60329.29284    19332.44118
neighborhood_feHartranft                    -134233.37655    23111.29428
neighborhood_feHawthorne                      13777.54929    28718.24043
neighborhood_feHolmesburg                    -22464.41705    20489.40271
neighborhood_feHunting Park                  -58262.37190    21986.15047
neighborhood_feJuniata Park                  -53922.36623    17783.21002
neighborhood_feKingsessing                  -104337.27552    17205.84308
neighborhood_feLawndale                      -44276.58755    17622.72019
neighborhood_feLexington Park                 17662.83984    28421.36503
neighborhood_feLogan                        -102229.22119    19913.69036
neighborhood_feLogan Square                  378209.87391    30046.74975
neighborhood_feLower Moyamensing             -67527.24596    17426.16791
neighborhood_feManayunk                       23370.38090    18136.13033
neighborhood_feMantua                        -94254.73550    27578.70622
neighborhood_feMayfair                       -16897.23671    18275.78865
neighborhood_feMelrose Park Gardens          -72729.44753    31890.12382
neighborhood_feMill Creek                   -111688.97732    26777.85134
neighborhood_feMillbrook                      14797.85844    30473.87809
neighborhood_feModena                         50735.89383    28427.48412
neighborhood_feMorrell Park                   19689.03919    27838.06634
neighborhood_feNewbold                       -56497.03852    21721.80232
neighborhood_feNicetown                      -67648.77791    37927.30068
neighborhood_feNormandy Village              146685.04769    48931.47795
neighborhood_feNorth Central                -106823.37807    22102.30044
neighborhood_feNorthern Liberties               481.92747    19174.20112
neighborhood_feNorthwood                    -103432.66823    23262.07072
neighborhood_feOgontz                        -42751.64985    20101.09358
neighborhood_feOld City                      441414.04297    45005.78428
neighborhood_feOld Kensington                -49329.35873    20271.70370
neighborhood_feOlney                         -73499.11703    17099.89634
neighborhood_feOverbrook                     -96635.50354    16957.29522
neighborhood_feOxford Circle                 -28220.70851    17764.94995
neighborhood_fePacker Park                     3642.13864    25590.75306
neighborhood_feParkwood Manor                 46455.91524    29433.57194
neighborhood_fePaschall                      -74689.10344    20081.59211
neighborhood_fePassyunk Square                 6201.59333    20552.29599
neighborhood_fePennsport                     -34409.06143    19586.56594
neighborhood_fePennypack                      -4523.71040    23823.41102
neighborhood_fePennypack Woods                 5229.28684    46532.06057
neighborhood_fePenrose                       -63337.08704    30678.49027
neighborhood_fePoint Breeze                  -65532.16390    17649.26141
neighborhood_feQueen Village                 108123.20101    21143.03384
neighborhood_feRhawnhurst                      7329.63996    21673.41480
neighborhood_feRichmond                      -31347.21320    15323.36272
neighborhood_feRittenhouse                   433418.89116    26235.66773
neighborhood_feRoxborough                      5858.89223    16074.66479
neighborhood_feRoxborough Park               -38559.05901    32678.85304
neighborhood_feSharswood                    -100231.24410    41101.18894
neighborhood_feSociety Hill                  350904.55769    26018.85748
neighborhood_feSomerton                       51813.99347    28987.71301
neighborhood_feSouthwest Germantown         -105499.69999    24781.96856
neighborhood_feSouthwest Schuylkill          -93075.97758    21056.77489
neighborhood_feSpring Garden                 176516.97743    25749.53782
neighborhood_feSpruce Hill                   128911.04123    29940.21649
neighborhood_feStadium District              -26591.69497    21895.37157
neighborhood_feStanton                      -141699.67184    21193.05351
neighborhood_feStrawberry Mansion           -104297.41270    20680.34316
neighborhood_feSummerdale                    -45226.77299    21606.36386
neighborhood_feTacony                        -43593.37227    19806.02785
neighborhood_feTioga                        -117442.73710    23684.16528
neighborhood_feTorresdale                    -29218.02295    26982.11001
neighborhood_feUpper Kensington              -91779.96891    18863.83228
neighborhood_feUpper Roxborough              -36151.81710    20210.02565
neighborhood_feWalnut Hill                   -63380.83662    30081.83436
neighborhood_feWashington Square West         90303.07716    29806.56362
neighborhood_feWest Central Germantown         8513.85073    27775.82087
neighborhood_feWest Kensington               -71823.95801    20237.95153
neighborhood_feWest Mount Airy                30961.36353    18359.17825
neighborhood_feWest Oak Lane                 -50873.53045    18202.47073
neighborhood_feWest Passyunk                 -80608.74626    19727.58538
neighborhood_feWest Poplar                  -113771.48430    43372.29599
neighborhood_feWest Powelton                -102313.69360    33591.55981
neighborhood_feWhitman                       -45632.07608    18197.33618
neighborhood_feWinchester Park                33485.37086    34580.48460
neighborhood_feWissahickon                    -1347.59066    20189.31256
neighborhood_feWissahickon Hills               8473.23320    32242.03432
neighborhood_feWissinoming                   -26725.21777    18295.40074
neighborhood_feWister                        -75747.05344    30877.69005
neighborhood_feWynnefield                   -105331.76285    20055.32988
neighborhood_feSmall Neighborhoods           -31451.72693    18977.39070
quarters_fe2                                   3512.10179     3374.12787
quarters_fe3                                   4744.67437     3433.21972
quarters_fe4                                   5975.11825     3561.86480
city_dist_mi:knn_amenity_mi                    1581.49465     2645.05950
city_dist_mi:septa_half_mi                     -325.43517       41.26321
                                           t value             Pr(>|t|)    
(Intercept)                                -36.927 < 0.0000000000000002 ***
ln_square_feet                              46.868 < 0.0000000000000002 ***
bath_num                                    18.956 < 0.0000000000000002 ***
ac_binary                                   12.693 < 0.0000000000000002 ***
fireplace_num                               23.261 < 0.0000000000000002 ***
story_num                                   -1.369             0.170932    
garage_num                                  22.648 < 0.0000000000000002 ***
basement_typeFull Finished                  11.231 < 0.0000000000000002 ***
basement_typeFull Semi-Finished              8.951 < 0.0000000000000002 ***
basement_typeFull Unfinished                 8.625 < 0.0000000000000002 ***
basement_typeFull Unknown Finish             7.573  0.00000000000003870 ***
basement_typePartial Finished                4.955  0.00000073320552996 ***
basement_typePartial Semi-Finished           4.324  0.00001545679760218 ***
basement_typePartial Unfinished              3.073             0.002121 ** 
basement_typePartial Unknown Finish          2.453             0.014192 *  
basement_typeUnknown Size Finished           4.129  0.00003662085316710 ***
basement_typeUnknown Size Unfinished         4.184  0.00002884940316156 ***
fuel_typeElectric                           -1.269             0.204512    
fuel_typeOil Heat                            0.173             0.862521    
fuel_typeOther                               0.097             0.922753    
age                                        -11.137 < 0.0000000000000002 ***
I(age^2)                                     7.635  0.00000000000002410 ***
medhhinc                                     7.149  0.00000000000092079 ***
pct_vacant                                  -3.272             0.001072 ** 
pct_single_family                            0.974             0.330076    
city_dist_mi                                -1.505             0.132249    
septa_half_mi                                7.113  0.00000000000118821 ***
ln_park_dist                                -1.279             0.201046    
knn_amenity_mi                              -0.793             0.428039    
neighborhood_feAcademy Gardens               1.106             0.268755    
neighborhood_feAllegheny West               -4.171  0.00003053958885205 ***
neighborhood_feAndorra                      -1.364             0.172626    
neighborhood_feAston-Woodbridge              0.924             0.355561    
neighborhood_feBella Vista                   3.700             0.000216 ***
neighborhood_feBelmont                      -1.939             0.052496 .  
neighborhood_feBrewerytown                  -4.003  0.00006277974208792 ***
neighborhood_feBridesburg                    0.115             0.908153    
neighborhood_feBurholme                     -0.461             0.644549    
neighborhood_feBustleton                     0.921             0.357109    
neighborhood_feCarroll Park                 -4.076  0.00004599128356312 ***
neighborhood_feCedar Park                    1.886             0.059311 .  
neighborhood_feCedarbrook                   -1.027             0.304291    
neighborhood_feChestnut Hill                19.048 < 0.0000000000000002 ***
neighborhood_feClearview                    -2.804             0.005048 ** 
neighborhood_feCobbs Creek                  -5.221  0.00000018094702145 ***
neighborhood_feDickinson Narrows            -2.323             0.020191 *  
neighborhood_feDunlap                       -4.478  0.00000757999945784 ***
neighborhood_feEast Germantown              -2.601             0.009314 ** 
neighborhood_feEast Kensington              -1.479             0.139242    
neighborhood_feEast Mount Airy              -2.535             0.011251 *  
neighborhood_feEast Oak Lane                -4.033  0.00005531987107097 ***
neighborhood_feEast Parkside                -2.425             0.015306 *  
neighborhood_feEast Passyunk                 1.071             0.284323    
neighborhood_feEastwick                     -2.055             0.039865 *  
neighborhood_feElmwood                      -3.897  0.00009777188054228 ***
neighborhood_feFairhill                     -1.915             0.055517 .  
neighborhood_feFairmount                     4.357  0.00001325740682429 ***
neighborhood_feFeltonville                  -3.361             0.000779 ***
neighborhood_feFern Rock                    -2.508             0.012136 *  
neighborhood_feFishtown - Lower Kensington   0.582             0.560611    
neighborhood_feFitler Square                14.172 < 0.0000000000000002 ***
neighborhood_feFox Chase                    -0.309             0.757448    
neighborhood_feFrancisville                 -2.793             0.005230 ** 
neighborhood_feFrankford                    -3.604             0.000314 ***
neighborhood_feFranklinville                -4.438  0.00000915089884873 ***
neighborhood_feGermantown - Morton          -2.742             0.006109 ** 
neighborhood_feGermantown - Penn Knox       -3.280             0.001040 ** 
neighborhood_feGermantown - Westside        -2.665             0.007715 ** 
neighborhood_feGermany Hill                  0.221             0.825091    
neighborhood_feGirard Estates               -1.754             0.079453 .  
neighborhood_feGlenwood                     -4.232  0.00002332478971586 ***
neighborhood_feGraduate Hospital             4.780  0.00000177298354943 ***
neighborhood_feGrays Ferry                  -4.541  0.00000564776099513 ***
neighborhood_feGreenwich                    -2.632             0.008499 ** 
neighborhood_feHaddington                   -4.947  0.00000076216522673 ***
neighborhood_feHarrowgate                   -3.121             0.001808 ** 
neighborhood_feHartranft                    -5.808  0.00000000645831556 ***
neighborhood_feHawthorne                     0.480             0.631414    
neighborhood_feHolmesburg                   -1.096             0.272927    
neighborhood_feHunting Park                 -2.650             0.008059 ** 
neighborhood_feJuniata Park                 -3.032             0.002432 ** 
neighborhood_feKingsessing                  -6.064  0.00000000136233459 ***
neighborhood_feLawndale                     -2.512             0.012000 *  
neighborhood_feLexington Park                0.621             0.534305    
neighborhood_feLogan                        -5.134  0.00000028814016953 ***
neighborhood_feLogan Square                 12.587 < 0.0000000000000002 ***
neighborhood_feLower Moyamensing            -3.875             0.000107 ***
neighborhood_feManayunk                      1.289             0.197556    
neighborhood_feMantua                       -3.418             0.000633 ***
neighborhood_feMayfair                      -0.925             0.355206    
neighborhood_feMelrose Park Gardens         -2.281             0.022586 *  
neighborhood_feMill Creek                   -4.171  0.00003052142868642 ***
neighborhood_feMillbrook                     0.486             0.627265    
neighborhood_feModena                        1.785             0.074324 .  
neighborhood_feMorrell Park                  0.707             0.479411    
neighborhood_feNewbold                      -2.601             0.009307 ** 
neighborhood_feNicetown                     -1.784             0.074504 .  
neighborhood_feNormandy Village              2.998             0.002725 ** 
neighborhood_feNorth Central                -4.833  0.00000135864190429 ***
neighborhood_feNorthern Liberties            0.025             0.979948    
neighborhood_feNorthwood                    -4.446  0.00000880077146772 ***
neighborhood_feOgontz                       -2.127             0.033452 *  
neighborhood_feOld City                      9.808 < 0.0000000000000002 ***
neighborhood_feOld Kensington               -2.433             0.014970 *  
neighborhood_feOlney                        -4.298  0.00001733699544840 ***
neighborhood_feOverbrook                    -5.699  0.00000001231809258 ***
neighborhood_feOxford Circle                -1.589             0.112183    
neighborhood_fePacker Park                   0.142             0.886827    
neighborhood_feParkwood Manor                1.578             0.114513    
neighborhood_fePaschall                     -3.719             0.000201 ***
neighborhood_fePassyunk Square               0.302             0.762850    
neighborhood_fePennsport                    -1.757             0.078980 .  
neighborhood_fePennypack                    -0.190             0.849402    
neighborhood_fePennypack Woods               0.112             0.910524    
neighborhood_fePenrose                      -2.065             0.038985 *  
neighborhood_fePoint Breeze                 -3.713             0.000206 ***
neighborhood_feQueen Village                 5.114  0.00000031986519421 ***
neighborhood_feRhawnhurst                    0.338             0.735228    
neighborhood_feRichmond                     -2.046             0.040804 *  
neighborhood_feRittenhouse                  16.520 < 0.0000000000000002 ***
neighborhood_feRoxborough                    0.364             0.715505    
neighborhood_feRoxborough Park              -1.180             0.238045    
neighborhood_feSharswood                    -2.439             0.014755 *  
neighborhood_feSociety Hill                 13.487 < 0.0000000000000002 ***
neighborhood_feSomerton                      1.787             0.073888 .  
neighborhood_feSouthwest Germantown         -4.257  0.00002084666647520 ***
neighborhood_feSouthwest Schuylkill         -4.420  0.00000993533104278 ***
neighborhood_feSpring Garden                 6.855  0.00000000000742997 ***
neighborhood_feSpruce Hill                   4.306  0.00001676857016520 ***
neighborhood_feStadium District             -1.214             0.224582    
neighborhood_feStanton                      -6.686  0.00000000002380541 ***
neighborhood_feStrawberry Mansion           -5.043  0.00000046342011412 ***
neighborhood_feSummerdale                   -2.093             0.036348 *  
neighborhood_feTacony                       -2.201             0.027752 *  
neighborhood_feTioga                        -4.959  0.00000071819032233 ***
neighborhood_feTorresdale                   -1.083             0.278887    
neighborhood_feUpper Kensington             -4.865  0.00000115505845040 ***
neighborhood_feUpper Roxborough             -1.789             0.073668 .  
neighborhood_feWalnut Hill                  -2.107             0.035140 *  
neighborhood_feWashington Square West        3.030             0.002453 ** 
neighborhood_feWest Central Germantown       0.307             0.759213    
neighborhood_feWest Kensington              -3.549             0.000388 ***
neighborhood_feWest Mount Airy               1.686             0.091737 .  
neighborhood_feWest Oak Lane                -2.795             0.005199 ** 
neighborhood_feWest Passyunk                -4.086  0.00004412082867494 ***
neighborhood_feWest Poplar                  -2.623             0.008722 ** 
neighborhood_feWest Powelton                -3.046             0.002325 ** 
neighborhood_feWhitman                      -2.508             0.012166 *  
neighborhood_feWinchester Park               0.968             0.332896    
neighborhood_feWissahickon                  -0.067             0.946784    
neighborhood_feWissahickon Hills             0.263             0.792708    
neighborhood_feWissinoming                  -1.461             0.144104    
neighborhood_feWister                       -2.453             0.014174 *  
neighborhood_feWynnefield                   -5.252  0.00000015267050722 ***
neighborhood_feSmall Neighborhoods          -1.657             0.097477 .  
quarters_fe2                                 1.041             0.297944    
quarters_fe3                                 1.382             0.166997    
quarters_fe4                                 1.678             0.093463 .  
city_dist_mi:knn_amenity_mi                  0.598             0.549913    
city_dist_mi:septa_half_mi                  -7.887  0.00000000000000333 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 137700 on 13638 degrees of freedom
Multiple R-squared:  0.7523,    Adjusted R-squared:  0.7494 
F-statistic: 260.4 on 159 and 13638 DF,  p-value: < 0.00000000000000022
# Step model to check for most influential predictors.
step_model <- step(final_model, direction = "both")
Start:  AIC=326698.2
sale_price ~ ln_square_feet + bath_num + ac_binary + fireplace_num + 
    story_num + garage_num + basement_type + fuel_type + age + 
    I(age^2) + medhhinc + pct_vacant + pct_single_family + city_dist_mi + 
    septa_half_mi + ln_park_dist + knn_amenity_mi + knn_amenity_mi * 
    city_dist_mi + septa_half_mi * city_dist_mi + neighborhood_fe + 
    quarters_fe

                               Df      Sum of Sq             RSS    AIC
- fuel_type                     3    31456529353 258630588679851 326694
- quarters_fe                   3    59495473287 258658627623784 326695
- city_dist_mi:knn_amenity_mi   1     6778614569 258605910765067 326697
- pct_single_family             1    17988300182 258617120450680 326697
- ln_park_dist                  1    31000772084 258630132922582 326698
- story_num                     1    35552177927 258634684328425 326698
<none>                                           258599132150498 326698
- pct_vacant                    1   202953511790 258802085662288 326707
- medhhinc                      1   969006856383 259568139006880 326748
- I(age^2)                      1  1105298342711 259704430493209 326755
- city_dist_mi:septa_half_mi    1  1179449642410 259778581792908 326759
- age                           1  2351993067410 260951125217908 326821
- ac_binary                     1  3054904480424 261654036630922 326858
- basement_type                10  5034519809658 263633651960156 326944
- bath_num                      1  6813284658901 265412416809399 327055
- garage_num                    1  9726291631783 268325423782280 327206
- fireplace_num                 1 10259617935047 268858750085545 327233
- ln_square_feet                1 41651714635345 300250846785843 328757
- neighborhood_fe             126 54332171013126 312931303163624 329078

Step:  AIC=326693.9
sale_price ~ ln_square_feet + bath_num + ac_binary + fireplace_num + 
    story_num + garage_num + basement_type + age + I(age^2) + 
    medhhinc + pct_vacant + pct_single_family + city_dist_mi + 
    septa_half_mi + ln_park_dist + knn_amenity_mi + neighborhood_fe + 
    quarters_fe + city_dist_mi:knn_amenity_mi + city_dist_mi:septa_half_mi

                               Df      Sum of Sq             RSS    AIC
- quarters_fe                   3    58251483410 258688840163261 326691
- city_dist_mi:knn_amenity_mi   1     7006089980 258637594769831 326692
- pct_single_family             1    17662065992 258648250745843 326693
- ln_park_dist                  1    31376577429 258661965257280 326694
- story_num                     1    35089819593 258665678499444 326694
<none>                                           258630588679851 326694
+ fuel_type                     3    31456529353 258599132150498 326698
- pct_vacant                    1   207734414580 258838323094431 326703
- medhhinc                      1   973601226853 259604189906704 326744
- I(age^2)                      1  1104835124027 259735423803878 326751
- city_dist_mi:septa_half_mi    1  1183213037125 259813801716976 326755
- age                           1  2347373601693 260977962281544 326817
- ac_binary                     1  3122005541691 261752594221542 326857
- basement_type                10  5058896467187 263689485147038 326941
- bath_num                      1  6794283429119 265424872108970 327050
- garage_num                    1  9746557921518 268377146601369 327202
- fireplace_num                 1 10300026471864 268930615151715 327231
- ln_square_feet                1 41871994839099 300502583518950 328762
- neighborhood_fe             126 54367116497342 312997705177193 329074

Step:  AIC=326691
sale_price ~ ln_square_feet + bath_num + ac_binary + fireplace_num + 
    story_num + garage_num + basement_type + age + I(age^2) + 
    medhhinc + pct_vacant + pct_single_family + city_dist_mi + 
    septa_half_mi + ln_park_dist + knn_amenity_mi + neighborhood_fe + 
    city_dist_mi:knn_amenity_mi + city_dist_mi:septa_half_mi

                               Df      Sum of Sq             RSS    AIC
- city_dist_mi:knn_amenity_mi   1     6992669269 258695832832530 326689
- pct_single_family             1    18160703601 258707000866862 326690
- ln_park_dist                  1    31054476238 258719894639499 326691
- story_num                     1    35225587524 258724065750784 326691
<none>                                           258688840163261 326691
+ quarters_fe                   3    58251483410 258630588679851 326694
+ fuel_type                     3    30212539476 258658627623784 326695
- pct_vacant                    1   209205012301 258898045175561 326700
- medhhinc                      1   976484533746 259665324697007 326741
- I(age^2)                      1  1105689878507 259794530041768 326748
- city_dist_mi:septa_half_mi    1  1187889628748 259876729792009 326752
- age                           1  2351142680862 261039982844123 326814
- ac_binary                     1  3124085436708 261812925599968 326855
- basement_type                10  5041751033966 263730591197227 326937
- bath_num                      1  6792334607719 265481174770979 327047
- garage_num                    1  9730464404128 268419304567389 327198
- fireplace_num                 1 10300411046192 268989251209453 327228
- ln_square_feet                1 41866233468651 300555073631912 328759
- neighborhood_fe             126 54377046516611 313065886679872 329071

Step:  AIC=326689.3
sale_price ~ ln_square_feet + bath_num + ac_binary + fireplace_num + 
    story_num + garage_num + basement_type + age + I(age^2) + 
    medhhinc + pct_vacant + pct_single_family + city_dist_mi + 
    septa_half_mi + ln_park_dist + knn_amenity_mi + neighborhood_fe + 
    city_dist_mi:septa_half_mi

                               Df      Sum of Sq             RSS    AIC
- knn_amenity_mi                1     7277984981 258703110817511 326688
- pct_single_family             1    17665464192 258713498296721 326688
- ln_park_dist                  1    30859841089 258726692673619 326689
- story_num                     1    33835482364 258729668314894 326689
<none>                                           258695832832530 326689
+ city_dist_mi:knn_amenity_mi   1     6992669269 258688840163261 326691
+ quarters_fe                   3    58238062699 258637594769831 326692
+ fuel_type                     3    30438931045 258665393901485 326694
- pct_vacant                    1   205830799764 258901663632294 326698
- medhhinc                      1  1024946067777 259720778900307 326742
- I(age^2)                      1  1107226451098 259803059283628 326746
- city_dist_mi:septa_half_mi    1  1369443215549 260065276048078 326760
- age                           1  2358849475070 261054682307599 326813
- ac_binary                     1  3126535824236 261822368656765 326853
- basement_type                10  5041056151592 263736888984122 326936
- bath_num                      1  6785716359645 265481549192175 327045
- garage_num                    1  9723962673082 268419795505611 327196
- fireplace_num                 1 10293591997457 268989424829987 327226
- ln_square_feet                1 41879022995694 300574855828224 328758
- neighborhood_fe             126 54462991804011 313158824636541 329074

Step:  AIC=326687.7
sale_price ~ ln_square_feet + bath_num + ac_binary + fireplace_num + 
    story_num + garage_num + basement_type + age + I(age^2) + 
    medhhinc + pct_vacant + pct_single_family + city_dist_mi + 
    septa_half_mi + ln_park_dist + neighborhood_fe + city_dist_mi:septa_half_mi

                              Df      Sum of Sq             RSS    AIC
- pct_single_family            1    16023323571 258719134141082 326687
- ln_park_dist                 1    32576166511 258735686984022 326687
- story_num                    1    33453235709 258736564053220 326688
<none>                                          258703110817511 326688
+ knn_amenity_mi               1     7277984982 258695832832530 326689
+ quarters_fe                  3    58637152574 258644473664937 326691
+ fuel_type                    3    30469203352 258672641614159 326692
- pct_vacant                   1   205400357025 258908511174536 326697
- medhhinc                     1  1050627342814 259753738160325 326742
- I(age^2)                     1  1105734593422 259808845410933 326745
- city_dist_mi:septa_half_mi   1  1372088231491 260075199049002 326759
- age                          1  2357403484032 261060514301543 326811
- ac_binary                    1  3124242419909 261827353237420 326851
- basement_type               10  5034914580427 263738025397938 326934
- bath_num                     1  6786088355475 265489199172986 327043
- garage_num                   1  9717658792378 268420769609889 327195
- fireplace_num                1 10286675978556 268989786796067 327224
- ln_square_feet               1 41891226774696 300594337592207 328757
- neighborhood_fe            126 54765308921607 313468419739118 329085

Step:  AIC=326686.6
sale_price ~ ln_square_feet + bath_num + ac_binary + fireplace_num + 
    story_num + garage_num + basement_type + age + I(age^2) + 
    medhhinc + pct_vacant + city_dist_mi + septa_half_mi + ln_park_dist + 
    neighborhood_fe + city_dist_mi:septa_half_mi

                              Df      Sum of Sq             RSS    AIC
- ln_park_dist                 1    34260207544 258753394348626 326686
- story_num                    1    35350325829 258754484466911 326686
<none>                                          258719134141082 326687
+ pct_single_family            1    16023323571 258703110817511 326688
+ knn_amenity_mi               1     5635844360 258713498296721 326688
+ quarters_fe                  3    59060739015 258660073402067 326689
+ fuel_type                    3    30124087256 258689010053826 326691
- pct_vacant                   1   203295170131 258922429311213 326695
- medhhinc                     1  1103565542391 259822699683472 326743
- I(age^2)                     1  1107375063863 259826509204945 326744
- city_dist_mi:septa_half_mi   1  1406466775628 260125600916710 326759
- age                          1  2357177538578 261076311679660 326810
- ac_binary                    1  3133523723037 261852657864119 326851
- basement_type               10  5095850020133 263814984161215 326936
- bath_num                     1  6797267159736 265516401300818 327042
- garage_num                   1  9833628399885 268552762540967 327199
- fireplace_num                1 10369199385128 269088333526210 327227
- ln_square_feet               1 42079169867674 300798304008756 328764
- neighborhood_fe            126 54749566200936 313468700342018 329083

Step:  AIC=326686.4
sale_price ~ ln_square_feet + bath_num + ac_binary + fireplace_num + 
    story_num + garage_num + basement_type + age + I(age^2) + 
    medhhinc + pct_vacant + city_dist_mi + septa_half_mi + neighborhood_fe + 
    city_dist_mi:septa_half_mi

                              Df      Sum of Sq             RSS    AIC
- story_num                    1    36783377486 258790177726112 326686
<none>                                          258753394348626 326686
+ ln_park_dist                 1    34260207544 258719134141082 326687
+ pct_single_family            1    17707364604 258735686984022 326687
+ knn_amenity_mi               1     7111852219 258746282496407 326688
+ quarters_fe                  3    58786289980 258694608058646 326689
+ fuel_type                    3    30487971848 258722906376778 326691
- pct_vacant                   1   213372544155 258966766892781 326696
- medhhinc                     1  1097276213456 259850670562082 326743
- I(age^2)                     1  1103534718154 259856929066780 326743
- city_dist_mi:septa_half_mi   1  1462427258435 260215821607061 326762
- age                          1  2349465632033 261102859980658 326809
- ac_binary                    1  3111345571022 261864739919648 326849
- basement_type               10  5092954356188 263846348704814 326935
- bath_num                     1  6794485517018 265547879865644 327042
- garage_num                   1  9820949223654 268574343572279 327198
- fireplace_num                1 10343512477155 269096906825781 327225
- ln_square_feet               1 42143077000556 300896471349182 328766
- neighborhood_fe            126 55955492133829 314708886482455 329136

Step:  AIC=326686.4
sale_price ~ ln_square_feet + bath_num + ac_binary + fireplace_num + 
    garage_num + basement_type + age + I(age^2) + medhhinc + 
    pct_vacant + city_dist_mi + septa_half_mi + neighborhood_fe + 
    city_dist_mi:septa_half_mi

                              Df      Sum of Sq             RSS    AIC
<none>                                          258790177726112 326686
+ story_num                    1    36783377486 258753394348626 326686
+ ln_park_dist                 1    35693259201 258754484466911 326686
+ pct_single_family            1    19775699983 258770402026129 326687
+ knn_amenity_mi               1     6651995539 258783525730573 326688
+ quarters_fe                  3    58918963475 258731258762637 326689
+ fuel_type                    3    29963731347 258760213994765 326691
- pct_vacant                   1   210445260804 259000622986916 326696
- I(age^2)                     1  1084785932414 259874963658526 326742
- medhhinc                     1  1092749935202 259882927661314 326743
- city_dist_mi:septa_half_mi   1  1485717762749 260275895488861 326763
- age                          1  2344717333626 261134895059738 326809
- ac_binary                    1  3101501770070 261891679496182 326849
- basement_type               10  5058612974825 263848790700937 326933
- bath_num                     1  6772868453985 265563046180097 327041
- garage_num                   1 10040789049779 268830966775891 327210
- fireplace_num                1 10386328629071 269176506355183 327227
- ln_square_feet               1 47795719610713 306585897336824 329023
- neighborhood_fe            126 55919556935705 314709734661817 329134
# Compute TSS
y <- model.response(model.frame(step_model))
tss <- sum((y - mean(y))^2)

# Sequential (Type I) Sum Sq → ΔR²
anova_model <- anova(step_model)
anova_model <- anova_model[!is.na(anova_model$"Sum Sq"), , drop = FALSE]
seq_imp <- transform(anova_model,
                     term = rownames(anova_model),
                     delta_R2 = `Sum Sq` / tss)

# Get top 5 predictors
seq_top4 <- seq_imp[order(-seq_imp$delta_R2), c("term", "Df", "delta_R2")]
head(seq_top4, 5)
                           term    Df   delta_R2
ln_square_feet   ln_square_feet     1 0.41288131
Residuals             Residuals 13649 0.24792516
neighborhood_fe neighborhood_fe   126 0.06691366
medhhinc               medhhinc     1 0.06230654
bath_num               bath_num     1 0.04956400
# Residual map preparation

# Match CRS
philly_neighborhoods <- st_transform(philly_neighborhoods, st_crs(final_data))

# Spatial join: assign each point to a neighborhood
points_with_neighborhood <- st_join(final_data, philly_neighborhoods)

# Add residual column
points_with_neighborhood$residuals <- residuals(final_model)

# Calculate average residual by neighborhood
neighborhood_residuals <- points_with_neighborhood %>%
  st_drop_geometry() %>%
  group_by(MAPNAME) %>%
  summarise(
    mean_residual = mean(residuals, na.rm = TRUE),
    median_residual = median(residuals, na.rm = TRUE),
    n_sales = n()
  )

# Join to neighborhoods
neighborhoods_with_residuals <- philly_neighborhoods %>%
  left_join(neighborhood_residuals, by = "MAPNAME")
# Map the averaged residuals

neighborhood_map <- ggplot(neighborhoods_with_residuals) +
  geom_sf(aes(fill = mean_residual), color = "black", size = 0.3) +
  scale_fill_gradient2(
    low = "blue2", 
    mid = "#f5f4f0", 
    high = "red2",
    midpoint = 0,
    labels = scales::dollar,
    name = "Mean Residual ($)"
  ) +
  theme_minimal() +
  labs(
    title = "Average Model Residuals by Neighborhood",
    subtitle = "Red = Under-Predicted | Blue = Over-Predicted"
  ) +
  theme(
    plot.title = element_text(face = "bold", size = 16),
    axis.text = element_blank(),
    axis.ticks = element_blank(),
    panel.grid = element_blank()
  )

neighborhood_map

#ggsave("slide_images/residual-neighborhood.png", plot = neighborhood_map, width = 8, height = 6, units = "in", dpi = 300)

University City is the hardest to predict, Penn owns a lot of property that doesn’t get taxed. And some less wealthy and disinvested neighborhoods are overvalued, like Parkside and Wynnefield in West Philadelphia, but the overall model predicts pretty accurately for most neighborhoods in Philadelphia.

Cross-Validation

10-Fold Cross-Validation

Logged Price Response Variable

library(caret)
# 10-fold cross-validation
# first model
set.seed(0)

train_control <- trainControl(method = "cv", number = 10, savePredictions="final")

cv_log_1 <- train(ln_sale_price ~ ln_square_feet + bath_num + # Structural.
                    ac_binary + fireplace_num + story_num + garage_num + # Structural.
                    basement_type + fuel_type + # Categorical structural.
                    age + I(age^2), # Age polynomial.
                    data = final_data,
                    method = "lm",
                    trControl = train_control)

cv_log_1$results
  intercept      RMSE  Rsquared       MAE      RMSESD RsquaredSD       MAESD
1      TRUE 0.3815519 0.6099798 0.2896351 0.008154798 0.01328205 0.006378651
library(caret)
# 10-fold cross-validation
# second model
set.seed(0)

train_control <- trainControl(method = "cv", number = 10, savePredictions="final")

cv_log_2 <- train(ln_sale_price ~ ln_square_feet + bath_num + # Structural.
                    ac_binary + fireplace_num + story_num + garage_num + # Structural.
                    basement_type + fuel_type + # Categorical structural.
                    age + I(age^2) + # Age polynomial.
                    medhhinc + pct_vacant + pct_single_family, # census
                    data = final_data,
                    method = "lm",
                    trControl = train_control)

cv_log_2$results
  intercept      RMSE  Rsquared       MAE     RMSESD RsquaredSD       MAESD
1      TRUE 0.3100579 0.7422833 0.2297654 0.01170645 0.01500042 0.007152849
library(caret)
# 10-fold cross-validation
# third model
set.seed(0)

train_control <- trainControl(method = "cv", number = 10, savePredictions="final")

cv_log_3 <- train(ln_sale_price ~ ln_square_feet + bath_num + # Structural.
                    ac_binary + fireplace_num + story_num + garage_num + # Structural.
                    basement_type + fuel_type + # Categorical structural.
                    age + I(age^2) + # Age polynomial.
                    medhhinc + pct_vacant + pct_single_family +
                    city_dist_mi + septa_half_mi + ln_park_dist + knn_amenity_mi, # Spatial
                    data = final_data,
                    method = "lm",
                    trControl = train_control)

cv_log_3$results
  intercept     RMSE  Rsquared       MAE      RMSESD RsquaredSD       MAESD
1      TRUE 0.295273 0.7664171 0.2176611 0.009431516  0.0108957 0.005275585
library(caret)
# 10-fold cross-validation
# third model
set.seed(0)

train_control <- trainControl(method = "cv", number = 10, savePredictions="final")

cv_log_4 <- train(ln_sale_price ~ ln_square_feet + bath_num + # Structural.
                    ac_binary + fireplace_num + story_num + garage_num + # Structural.
                    basement_type + fuel_type + # Categorical structural.
                    age + I(age^2) + # Age polynomial.
                    medhhinc + pct_vacant + pct_single_family +
                    city_dist_mi + septa_half_mi + ln_park_dist + knn_amenity_mi + # Spatial
                    knn_amenity_mi * city_dist_mi + septa_half_mi * city_dist_mi + # Interaction between amenities and downtown distance.
                    neighborhood_fe, # Fixed effect  
                    data = final_data,
                    method = "lm",
                    trControl = train_control)

cv_log_4$results
  intercept      RMSE  Rsquared       MAE      RMSESD RsquaredSD       MAESD
1      TRUE 0.2550054 0.8256555 0.1812691 0.009637928 0.01206398 0.004562766
# Compare all 4 models

## Combine four models:cv_model_1, cv_model_2, cv_model_3, cv_model_4
log_compare <- bind_rows(
  cv_log_1$results %>% 
    mutate(Model = "Model 1: Structural"),
  cv_log_2$results %>% 
    mutate(Model = "Model 2: Structural + Census"),
  cv_log_3$results %>% 
    mutate(Model = "Model 3: Structural + Census + Spatial"),
  cv_log_4$results %>% 
    mutate(Model = "Model 4: Final Model")
) %>%
  select(Model, RMSE, Rsquared, MAE) %>% 
  mutate(across(c(RMSE, Rsquared, MAE), round, 3))

## Plot
log_kable <- kable(log_compare,
                     col.names = c("Model", "RMSE", "R²", "MAE"),
                     caption = "Log Model Performance Comparison (10-Fold Cross-Validation)",
                     digits = 4,
                     format.args = list(big.mark = ",")
) %>%
  kable_styling(latex_options = "striped", full_width = FALSE) %>%
  column_spec(1, bold = TRUE, width = "10cm") %>%
  row_spec(0, color = "#f5f4f0", background = "#ff4100")

log_kable
Log Model Performance Comparison (10-Fold Cross-Validation)
Model RMSE MAE
Model 1: Structural 0.382 0.610 0.290
Model 2: Structural + Census 0.310 0.742 0.230
Model 3: Structural + Census + Spatial 0.295 0.766 0.218
Model 4: Final Model 0.255 0.826 0.181

Unlogged Price Response Variable

library(caret)
# 10-fold cross-validation
# first model
set.seed(0)

train_control <- trainControl(method = "cv", number = 10, savePredictions="final")

cv_model_1 <- train(sale_price ~ ln_square_feet + bath_num + # Structural.
                    ac_binary + fireplace_num + story_num + garage_num + # Structural.
                    basement_type + fuel_type + # Categorical structural.
                    age + I(age^2), # Age polynomial.
                    data = final_data,
                    method = "lm",
                    trControl = train_control)

cv_model_1$results
  intercept     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
1      TRUE 175423.7 0.5911972 102109.1 18603.98 0.06339336 3350.257
library(caret)
# 10-fold cross-validation
# second model
set.seed(0)

train_control <- trainControl(method = "cv", number = 10, savePredictions="final")

cv_model_2 <- train(sale_price ~ ln_square_feet + bath_num + # Structural.
                    ac_binary + fireplace_num + story_num + garage_num + # Structural.
                    basement_type + fuel_type + # Categorical structural.
                    age + I(age^2) + # Age polynomial.
                    medhhinc + pct_vacant + pct_single_family, # census
                    data = final_data,
                    method = "lm",
                    trControl = train_control)

cv_model_2$results
  intercept     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
1      TRUE 161361.7 0.6544926 88453.36 19993.33   0.061163 3491.685
library(caret)
# 10-fold cross-validation
# third model
set.seed(0)

train_control <- trainControl(method = "cv", number = 10, savePredictions="final")

cv_model_3 <- train(sale_price ~ ln_square_feet + bath_num + # Structural.
                    ac_binary + fireplace_num + story_num + garage_num + # Structural.
                    basement_type + fuel_type + # Categorical structural.
                    age + I(age^2) + # Age polynomial.
                    medhhinc + pct_vacant + pct_single_family +
                    city_dist_mi + septa_half_mi + ln_park_dist + knn_amenity_mi, # Spatial
                    data = final_data,
                    method = "lm",
                    trControl = train_control)

cv_model_3$results
  intercept     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
1      TRUE 154373.3 0.6841768 84570.17 19329.05 0.05523242 2730.257
library(caret)
# 10-fold cross-validation
# third model
set.seed(0)

train_control <- trainControl(method = "cv", number = 10, savePredictions="final")

cv_model_4 <- train(sale_price ~ ln_square_feet + bath_num + # Structural.
                    ac_binary + fireplace_num + story_num + garage_num + # Structural.
                    basement_type + fuel_type + # Categorical structural.
                    age + I(age^2) + # Age polynomial.
                    medhhinc + pct_vacant + pct_single_family +
                    city_dist_mi + septa_half_mi + ln_park_dist + knn_amenity_mi + # Spatial
                    knn_amenity_mi * city_dist_mi + septa_half_mi * city_dist_mi + # Interaction between amenities and downtown distance.
                    neighborhood_fe, # Fixed effect  
                    data = final_data,
                    method = "lm",
                    trControl = train_control)

cv_model_4$results
  intercept     RMSE  Rsquared     MAE   RMSESD RsquaredSD    MAESD
1      TRUE 138562.1 0.7452823 72130.3 18854.11  0.0510009 1576.612
# Compare all 4 models

## Combine four models:cv_model_1, cv_model_2, cv_model_3, cv_model_4
model_compare <- bind_rows(
  cv_model_1$results %>% 
    mutate(Model = "Model 1: Structural"),
  cv_model_2$results %>% 
    mutate(Model = "Model 2: Structural + Census"),
  cv_model_3$results %>% 
    mutate(Model = "Model 3: Structural + Census + Spatial"),
  cv_model_4$results %>% 
    mutate(Model = "Model 4: Final Model")
) %>%
  select(Model, RMSE, Rsquared, MAE) %>% 
  mutate(across(c(RMSE, Rsquared, MAE), round, 3))

## Plot
model_kable <- kable(model_compare,
                     col.names = c("Model", "RMSE ($)", "R²", "MAE ($)"),
                     caption = "Model Performance Comparison (10-Fold Cross-Validation)",
                     digits = 4,
                     format.args = list(big.mark = ",")
) %>%
  kable_styling(latex_options = "striped", full_width = FALSE) %>%
  column_spec(1, bold = TRUE, width = "10cm") %>%
  row_spec(0, color = "#f5f4f0", background = "#ff4100")

model_kable
Model Performance Comparison (10-Fold Cross-Validation)
Model RMSE ($) MAE ($)
Model 1: Structural 175,423.7 0.591 102,109.10
Model 2: Structural + Census 161,361.7 0.654 88,453.36
Model 3: Structural + Census + Spatial 154,373.3 0.684 84,570.17
Model 4: Final Model 138,562.1 0.745 72,130.30
# Create predicted vs. actual plot

pred_df <- cv_model_4$pred

# Plot Predicted vs Actual 
pred_v_act <- ggplot(pred_df, aes(x = obs, y = pred)) +
  geom_point(alpha = 0.5, color = "#2C7BB6") +
  geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
  labs(
    title = "Predicted vs. Actual Sale Price",
    x = "Actual Sale Price ($)",
    y = "Predicted Sale Price ($)"
  ) +
  scale_x_continuous(labels = scales::dollar) +
  scale_y_continuous(labels = scales::dollar) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14, margin = margin(b = 10)),
    panel.grid.minor = element_blank(),
    axis.title.x = element_text(margin = margin(t = 10)),
    axis.title.y = element_text(margin = margin(r = 10))
  )

pred_v_act

Model Diagnostics

Check Assumptions

resid_df <- data.frame(
  fitted = fitted(final_model),
  residuals = resid(final_model)
)

resid_v_fit <- ggplot(resid_df, aes(x = fitted, y = residuals)) +
  geom_point(alpha = 0.5, color = "#2C7BB6") +
  geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
  labs(
    title = "Residuals vs. Fitted Values",
    x = "Fitted Values (Predicted Sale Price)",
    y = "Residuals"
  ) +
  scale_x_continuous(labels = scales::dollar) +
  scale_y_continuous(labels = scales::dollar) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14, margin = margin(b = 10)),
    panel.grid.minor = element_blank(),
    axis.title.x = element_text(margin = margin(t = 10)),
    axis.title.y = element_text(margin = margin(r = 10))
  )

resid_v_fit

Residuals vs Fitted Values:

The residual plot shows that most residuals are centered around zero, but the spread of residuals increases as the fitted values grow. This “funnel-shaped” pattern suggests potential heteroskedasticity, meaning the variance of errors may increase for higher-priced properties. While the overall linearity assumption appears reasonable, the increasing dispersion indicates that the model’s prediction error is not constant across the price range.

resid_df <- data.frame(residuals = residuals(final_model))

# Q-Q Plot
qq_plot <- ggplot(resid_df, aes(sample = residuals)) +
  stat_qq(color = "#2C7BB6", alpha = 0.6, size = 2) +      
  stat_qq_line(color = "red", linetype = "dashed") +        
  labs(
    title = "Normal Q-Q Plot of Residuals",
    subtitle = "Check for normality assumption.",
    x = "Theoretical Quantiles (Normal Distribution)",
    y = "Sample Quantiles (Residuals)"
  ) +
  scale_x_continuous(labels = scales::comma) +
  scale_y_continuous(labels = scales::dollar) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14, margin = margin(b = 10)),
    plot.subtitle = element_text(size = 11),
    panel.grid.minor = element_blank(),
    axis.title.x = element_text(margin = margin(t = 10)),
    axis.title.y = element_text(margin = margin(r = 10))
  )

qq_plot

Normal Q–Q Plot:

The Q–Q plot reveals that the residuals deviate from the reference line in both tails, especially in the upper tail. This pattern indicates that the residuals are right-skewed and not perfectly normally distributed. The deviation is mainly driven by a small number of very high sale-price observations, which pull the residual distribution upward. However, moderate departures from normality are common in housing price data and generally do not invalidate the model.

cd <- cooks.distance(final_model)

used_row_idx <- as.integer(rownames(model.frame(final_model)))

cooks_df <- tibble(
  row_in_data  = used_row_idx,           
  row_in_model = seq_along(cd),          
  cooks_distance = as.numeric(cd)
)

n_used <- length(cd)
threshold <- 4 / n_used

cooks_plot <- ggplot(cooks_df, aes(x = row_in_model, y = cooks_distance)) +
  geom_bar(stat = "identity", fill = "#2C7BB6", alpha = 0.6) +
  geom_hline(yintercept = threshold, color = "red", linetype = "dashed") +
  coord_cartesian(ylim = c(0, 0.02)) +  
  labs(
    title = "Cook's Distance (Zoomed In)",
    subtitle = paste0("Red Dashed Line = 4/n ≈ ", round(threshold, 5)),
    x = "Observation Index (In-Model)",
    y = "Cook's Distance"
  ) +
  scale_x_continuous(labels = scales::comma) +
  scale_y_continuous(labels = scales::comma) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14, margin = margin(b = 10)),
    plot.subtitle = element_text(size = 11),
    panel.grid.minor = element_blank(),
    axis.title.x = element_text(margin = margin(t = 10)),
    axis.title.y = element_text(margin = margin(r = 10))
  )

cooks_plot

# Most influential
top_influential <- cooks_df %>%
  filter(cooks_distance > threshold) %>%
  arrange(desc(cooks_distance)) %>%
  slice_head(n = 10) %>%
  mutate(
    sale_price = final_data$sale_price[row_in_data]   
  ) %>%
  select(row_in_model, row_in_data, sale_price, cooks_distance)

top_influential
# A tibble: 10 × 4
   row_in_model row_in_data sale_price cooks_distance
          <int>       <int>      <dbl>          <dbl>
 1         6150        6150    3600000         0.199 
 2         6437        6437    5477901         0.130 
 3          249         249    3330400         0.0464
 4         7966        7967     601004         0.0451
 5         2537        2537    3995000         0.0397
 6        11379       11380     281000         0.0358
 7         7649        7650     330000         0.0326
 8         2636        2636    3000000         0.0285
 9         1975        1975     170000         0.0240
10         5466        5466    3850000         0.0232

Cook’s Distance:

The Cook’s distance plot shows that almost all observations have very small influence values (below the 4/n threshold), indicating that the model is not dominated by a few extreme points. A few cases exhibit slightly higher Cook’s D values, suggesting the presence of mildly influential outliers, but none appear to exert excessive leverage on the regression coefficients.

Detailed Discussion

Our final model achieves a cross-validated R² of 0.825, explaining approximately 83% of the variance in Philadelphia residential sale prices. The Mean Absolute Error (MAE) of 72,327.67 USD indicates that, on average, predicted sale prices deviate from actual prices by roughly 17% of the median home price. However, the Root Mean Squared Error (RMSE) of 137,218.70USD—nearly double the MAE—reveals that the model struggles disproportionately with high-value properties. This discrepancy, combined with residuals ranging from about -800,000 USD to +5 million USD, reflects the outsized influence of luxury homes on overall error. Diagnostic plots confirm these patterns: the Q-Q plot shows deviation from normality in both tails (especially the upper tail), while the residuals vs. fitted values plot exhibits a funnel-shaped pattern indicating heteroskedasticity—prediction error variance increases systematically for higher-priced properties. Despite these issues, the median residual of 2,339 USD (near zero) suggests the model’s predictions are generally unbiased for typical homes, and Cook’s distance values remain well below concerning thresholds, indicating no single observation dominates the model.

The model’s minimum and maximum residuals range from about -800,000 USD to +5 million USD, reflecting the outlier influence from luxury properties. While simultaneously referencing the QQ Plot, the model reflects two tails in the plot, but more in the positive region, indicating the model is not 100% normal. However, the residual distribution with the median residual of ~2,300 USD is somewhat marginal in the broader aspect, meaning the model predictions are generally centered about sale prices with limited bias in sale prediction. It is imperative to keep in mind that in the residuals vs fitted values plot, the increase of observations fanning-out as the predicted sale price value increases indicates the existence of heteroskedasticity between some variables.

We have concluded that ln_square_feet, neighborhood fixed effects, median household income, number of bathrooms and number of fireplaces as the most significant variables in our model. We calculated this by doing a stepwise regression, and these five yield the highest \(R^2\).

We grossly underpredicted for University City, and a decent amount for Fairmount. This model struggles to accurately predict rural areas and overpredicted uninvested neighborhoods. The severe underprediction of University City may be due to the existence and proximity to Drexel University and the University of Pennsylvania.

External Resources

Downloads

Relevant Readings

Artificial Intelligence

  • Claude: For code debugging in data cleaning and visualizations.

  • Chat GPT: For code debugging in data cleaning and visualizations.